-
Notifications
You must be signed in to change notification settings - Fork 876
Broken partial_ratio functionality with python-Levenshtein #79
Comments
Yes! It was driving me crazy today! I was trying this: fuzz.partial_ratio('Solo Tango Paisaje / Pablo Rodriguez - Corina Hererra / Planetango XIII'.lower(), 'Herrera'.lower()) and I get 29 score. 29 score matching Hererra with Herrera? Removing random parts of the first string seems to improve the score. This seems kind of a critical bug though, and I see now it's been over a year it was posted. |
Indeed removing python-Levenshtein fixed the problem! Thanks! |
Thanks to this bug - i didnt install python-Levenshtein yet :-) |
@acslater00 thoughts on this? |
Interesting -- I'll take a look at this |
seems like this is a known issue ztane/python-Levenshtein#16 My assumption was that get_matching_blocks() would always return the longest continuous block, but that seems to not be the case in some edge cases. |
this is a pretty critical bug, are there plans to work on this? Warning should probably be removed in the meantime |
@eliot1019 this code works well enough for our use cases, but if you'd like to fork/fix python-levenshtein, then maybe some progress can be made? The warning is because you're using a slower implementation which doesn't quite have the same output either, depending on your input. |
This issue is continuing to give people grief - see this SO question and my answer. For a really minimal example, consider:
This returns 50. It should be 100. A slightly more real-world example:
This returns 67. It should be 100. Why is this happening? There are two possible sets of minimum operations to convert There is potential for this bug to pop up any time that the shorter string is found as a non-continuous sequence in the longer string. This is unexpected, hard to account for, and potentially very problematic. I don't see this as being an isolated edge case. Perhaps it would be sensible to remove the use of |
@andrewguy I agree with this. I already use a implementation based on difflib in my implementation in RapidFuzz, which works correctly. The main argument for the implementation of python-Levenshtein appears to be it's performance. However the difflib based implementation in RapidFuzz is in no way slower. @acslater00 @josegonzalez I would be willing to port my implementation to FuzzyWuzzy. Probably the simplest solution would be a get_matching_blocks API in RapidFuzz, so there is no addition maintenance effort for SeatGeek. |
The partial_ratio calculation seems to yield incorrect results for certain combinations of strings when it uses the python-Levenshtein SequenceMatcher.
This works well:
But changing the longer string slightly, while not affecting the target:
Digging deeper, it appears that the get_matching_blocks() method returns substrings that do not actually match the string we are searching for, so the subsequent ratio calculations are performed on a set of poorly-matched ones.
Removing python-Levenshtein and using the python-only SequenceMatcher makes that method perform its job correctly. I couldn't figure out what it was about certain strings that made it break, after trying a whole bunch.
To top it off, the python-Levenshtein library appears to have been left unsupported for a while now. Any ideas? Maybe for now, removing the recommendation to use python-Levenshtein would let code run correctly, if not as fast? Thanks!
The text was updated successfully, but these errors were encountered: