Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Broken partial_ratio functionality with python-Levenshtein #79

Open
BorisVa opened this issue Feb 23, 2015 · 10 comments
Open

Broken partial_ratio functionality with python-Levenshtein #79

BorisVa opened this issue Feb 23, 2015 · 10 comments

Comments

@BorisVa
Copy link

BorisVa commented Feb 23, 2015

The partial_ratio calculation seems to yield incorrect results for certain combinations of strings when it uses the python-Levenshtein SequenceMatcher.

This works well:

> fuzz.partial_ratio('this is a test', 'is this is a not this is a test!')
> 100

But changing the longer string slightly, while not affecting the target:

> fuzz.partial_ratio('this is a test', 'is this is a not really thing this is a test!') 
> 92

Digging deeper, it appears that the get_matching_blocks() method returns substrings that do not actually match the string we are searching for, so the subsequent ratio calculations are performed on a set of poorly-matched ones.

Removing python-Levenshtein and using the python-only SequenceMatcher makes that method perform its job correctly. I couldn't figure out what it was about certain strings that made it break, after trying a whole bunch.

To top it off, the python-Levenshtein library appears to have been left unsupported for a while now. Any ideas? Maybe for now, removing the recommendation to use python-Levenshtein would let code run correctly, if not as fast? Thanks!

@stefdoerr
Copy link

Yes! It was driving me crazy today! I was trying this:

fuzz.partial_ratio('Solo Tango Paisaje / Pablo Rodriguez - Corina Hererra / Planetango XIII'.lower(), 'Herrera'.lower())

and I get 29 score. 29 score matching Hererra with Herrera? Removing random parts of the first string seems to improve the score.
I will try your suggestion of removing Levenshtein.

This seems kind of a critical bug though, and I see now it's been over a year it was posted.

@stefdoerr
Copy link

Indeed removing python-Levenshtein fixed the problem! Thanks!

@urupvog
Copy link

urupvog commented Jun 21, 2016

Thanks to this bug - i didnt install python-Levenshtein yet :-)

@josegonzalez
Copy link
Contributor

@acslater00 thoughts on this?

@acslater00
Copy link
Contributor

Interesting -- I'll take a look at this

@acslater00
Copy link
Contributor

seems like this is a known issue ztane/python-Levenshtein#16

My assumption was that get_matching_blocks() would always return the longest continuous block, but that seems to not be the case in some edge cases.

@eliot1019
Copy link

this is a pretty critical bug, are there plans to work on this? Warning should probably be removed in the meantime

@josegonzalez
Copy link
Contributor

@eliot1019 this code works well enough for our use cases, but if you'd like to fork/fix python-levenshtein, then maybe some progress can be made? The warning is because you're using a slower implementation which doesn't quite have the same output either, depending on your input.

@andrewguy
Copy link

This issue is continuing to give people grief - see this SO question and my answer.

For a really minimal example, consider:

fuzz.partial_ratio('test', 't e s test t')

This returns 50. It should be 100.

A slightly more real-world example:

fuzz.partial_ratio('fat', 'find a fat cat')

This returns 67. It should be 100.

Why is this happening? There are two possible sets of minimum operations to convert "test" into "t e s test t". python-Levenshtein chooses the one that involves deleting the continuous "test" substring. I don't see this as an issue with python-Levenshtein, as the calculation of Levenshtein distance doesn't explicitly give weight to continuous blocks.

There is potential for this bug to pop up any time that the shorter string is found as a non-continuous sequence in the longer string. This is unexpected, hard to account for, and potentially very problematic. I don't see this as being an isolated edge case. Perhaps it would be sensible to remove the use of python-Levenshtein until a solution is found? diffLib does seem to handle this better.

@maxbachmann
Copy link

maxbachmann commented May 2, 2021

@andrewguy I agree with this. I already use a implementation based on difflib in my implementation in RapidFuzz, which works correctly. The main argument for the implementation of python-Levenshtein appears to be it's performance. However the difflib based implementation in RapidFuzz is in no way slower.

@acslater00 @josegonzalez I would be willing to port my implementation to FuzzyWuzzy. Probably the simplest solution would be a get_matching_blocks API in RapidFuzz, so there is no addition maintenance effort for SeatGeek.
However looking at recent Pull Requests it appears like SeatGeek does not has the time to review any Pull Requests. In case your interested please let me know, since I do not want to waste my time on a solution that will not be reviewed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants