Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Finding best matches in a list gives wrong results. #77

Closed
ardoi opened this issue Feb 13, 2015 · 2 comments · Fixed by #142
Closed

Finding best matches in a list gives wrong results. #77

ardoi opened this issue Feb 13, 2015 · 2 comments · Fixed by #142

Comments

@ardoi
Copy link

ardoi commented Feb 13, 2015

The process.extract function doesn't seem to handle capitalised queries well with some scorers:

>>>fuzzywuzzy.fuzz.partial_ratio('Santa Ana','Santa Ana')
100
>>>fuzzywuzzy.process.extract('Santa Ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio)
[('Manta', 80), ('Santa Ana', 77)]

This is because here the choice string is processed but the query string is not:

>>>fuzzywuzzy.process.extract('santa ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio)
[('Santa Ana', 100), ('Manta', 80)]

With some other scorers (e.g., WRatio) things work fine because those processes both strings internally anyway.

A workaround I use right now is:

>>>fuzzywuzzy.process.extract('Santa Ana', ['Santa Ana', 'Manta'], scorer=fuzzywuzzy.fuzz.partial_ratio, processor=lambda x:x)
[('Santa Ana', 100), ('Manta', 80)]
@josegonzalez
Copy link
Contributor

@acslater00 Thoughts?

@acslater00
Copy link
Contributor

Yeah this definitely looks broken.

Unfortunately I don't think it's safe to run processor() on the query because it isn't and hasn't been part of the spec. It makes sense when processing is just doing an encoding conversion and running .lower(), but it wouldn't make sense in a situation like this

choices = [
    "[2014-01-01] abcd",
    "[2014-01-01] cdef"
    "[2014-01-01] ghij"
]

def strip_date(s):
    return s[12:]

process.extract("abcde", choices, processor=strip_date, scorer=fuzz.ratio)

This example is contrived but not too unrealistic.

In retrospect there probably shouldn't be a default processor at all, since the default scorer processes its own input. However, changing that would not be BC.

The problem here, I think, is that there is a mismatch between the expected input to the custom scorer (in this case, ratio.partial_ratio) and the input to extract. Another reasonable solution is the following

In [28]: process.extract(
    utils.full_process("Santa Ana"), 
    ['Santa Ana', 'Manta'], 
   scorer=fuzz.partial_ratio)
Out[28]: [('Santa Ana', 100), ('Manta', 80)]

ethanwhite added a commit to ethanwhite/core-transient that referenced this issue Nov 4, 2015
1. Set processor=str so that this design flaw in fuzzywuzzy:
seatgeek/fuzzywuzzy#77
doesn't cause result in incorrect ratio calculations by lower casing
only one of the strings to be compared.
2. Use a simple ratio for the string comparison instead of a weighted
average with more complex ratios which are less appropriate to the
task.
ethanwhite added a commit to weecology/bbc-data-rescue that referenced this issue Jan 26, 2016
1. Set processor=str so that this design flaw in fuzzywuzzy:
seatgeek/fuzzywuzzy#77
doesn't cause result in incorrect ratio calculations by lower casing
only one of the strings to be compared.
2. Use a simple ratio for the string comparison instead of a weighted
average with more complex ratios which are less appropriate to the
task.
paulbodean88 pushed a commit to paulbodean88/fuzzywuzzy that referenced this issue Sep 30, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants