Finding best matches in a list gives wrong results. #77

ardoi · 2015-02-13T10:41:09Z

The process.extract function doesn't seem to handle capitalised queries well with some scorers:

>>>fuzzywuzzy.fuzz.partial_ratio('Santa Ana','Santa Ana')
100
>>>fuzzywuzzy.process.extract('Santa Ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio)
[('Manta', 80), ('Santa Ana', 77)]

This is because here the choice string is processed but the query string is not:

>>>fuzzywuzzy.process.extract('santa ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio)
[('Santa Ana', 100), ('Manta', 80)]

With some other scorers (e.g., WRatio) things work fine because those processes both strings internally anyway.

A workaround I use right now is:

>>>fuzzywuzzy.process.extract('Santa Ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio, processor=lambda x:x)
[('Santa Ana', 100), ('Manta', 80)]

The text was updated successfully, but these errors were encountered:

josegonzalez · 2015-02-13T16:39:49Z

@acslater00 Thoughts?

acslater00 · 2015-02-19T21:25:01Z

Yeah this definitely looks broken.

Unfortunately I don't think it's safe to run processor() on the query because it isn't and hasn't been part of the spec. It makes sense when processing is just doing an encoding conversion and running .lower(), but it wouldn't make sense in a situation like this

choices = [
    "[2014-01-01] abcd",
    "[2014-01-01] cdef"
    "[2014-01-01] ghij"
]

def strip_date(s):
    return s[12:]

process.extract("abcde", choices, processor=strip_date, scorer=fuzz.ratio)

This example is contrived but not too unrealistic.

In retrospect there probably shouldn't be a default processor at all, since the default scorer processes its own input. However, changing that would not be BC.

The problem here, I think, is that there is a mismatch between the expected input to the custom scorer (in this case, ratio.partial_ratio) and the input to extract. Another reasonable solution is the following

In [28]: process.extract(
    utils.full_process("Santa Ana"), 
    ['Santa Ana', 'Manta'], 
   scorer=fuzz.partial_ratio)
Out[28]: [('Santa Ana', 100), ('Manta', 80)]

1. Set processor=str so that this design flaw in fuzzywuzzy: seatgeek/fuzzywuzzy#77 doesn't cause result in incorrect ratio calculations by lower casing only one of the strings to be compared. 2. Use a simple ratio for the string comparison instead of a weighted average with more complex ratios which are less appropriate to the task.

paulbodean88 pushed a commit to paulbodean88/fuzzywuzzy that referenced this issue Sep 30, 2016

fix issue seatgeek#77

831c4ae

This was referenced Oct 29, 2016

Strange results that depends on sort and case #141

Open

Clarify default behaviour of extract / Add tests for matching strings #142

Merged

josegonzalez closed this as completed in #142 Nov 1, 2016

nol13 mentioned this issue Dec 4, 2016

Remove query processing and default processor #150

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding best matches in a list gives wrong results. #77

Finding best matches in a list gives wrong results. #77

ardoi commented Feb 13, 2015

josegonzalez commented Feb 13, 2015

acslater00 commented Feb 19, 2015

Finding best matches in a list gives wrong results. #77

Finding best matches in a list gives wrong results. #77

Comments

ardoi commented Feb 13, 2015

josegonzalez commented Feb 13, 2015

acslater00 commented Feb 19, 2015