Broken partial_ratio functionality with python-Levenshtein #79

BorisVa · 2015-02-23T23:01:18Z

The partial_ratio calculation seems to yield incorrect results for certain combinations of strings when it uses the python-Levenshtein SequenceMatcher.

This works well:

> fuzz.partial_ratio('this is a test', 'is this is a not this is a test!')
> 100

But changing the longer string slightly, while not affecting the target:

> fuzz.partial_ratio('this is a test', 'is this is a not really thing this is a test!') 
> 92

Digging deeper, it appears that the get_matching_blocks() method returns substrings that do not actually match the string we are searching for, so the subsequent ratio calculations are performed on a set of poorly-matched ones.

Removing python-Levenshtein and using the python-only SequenceMatcher makes that method perform its job correctly. I couldn't figure out what it was about certain strings that made it break, after trying a whole bunch.

To top it off, the python-Levenshtein library appears to have been left unsupported for a while now. Any ideas? Maybe for now, removing the recommendation to use python-Levenshtein would let code run correctly, if not as fast? Thanks!

stefdoerr · 2016-04-02T18:44:16Z

Yes! It was driving me crazy today! I was trying this:

fuzz.partial_ratio('Solo Tango Paisaje / Pablo Rodriguez - Corina Hererra / Planetango XIII'.lower(), 'Herrera'.lower())

and I get 29 score. 29 score matching Hererra with Herrera? Removing random parts of the first string seems to improve the score.
I will try your suggestion of removing Levenshtein.

This seems kind of a critical bug though, and I see now it's been over a year it was posted.

stefdoerr · 2016-04-02T18:46:03Z

Indeed removing python-Levenshtein fixed the problem! Thanks!

urupvog · 2016-06-21T15:56:49Z

Thanks to this bug - i didnt install python-Levenshtein yet :-)

josegonzalez · 2016-06-21T16:42:50Z

@acslater00 thoughts on this?

acslater00 · 2016-06-21T18:25:26Z

Interesting -- I'll take a look at this

acslater00 · 2016-06-28T04:56:11Z

seems like this is a known issue ztane/python-Levenshtein#16

My assumption was that get_matching_blocks() would always return the longest continuous block, but that seems to not be the case in some edge cases.

eliot1019 · 2018-06-25T20:21:09Z

this is a pretty critical bug, are there plans to work on this? Warning should probably be removed in the meantime

josegonzalez · 2018-06-25T20:50:27Z

@eliot1019 this code works well enough for our use cases, but if you'd like to fork/fix python-levenshtein, then maybe some progress can be made? The warning is because you're using a slower implementation which doesn't quite have the same output either, depending on your input.

andrewguy · 2021-03-22T03:17:44Z

This issue is continuing to give people grief - see this SO question and my answer.

For a really minimal example, consider:

fuzz.partial_ratio('test', 't e s test t')

This returns 50. It should be 100.

A slightly more real-world example:

fuzz.partial_ratio('fat', 'find a fat cat')

This returns 67. It should be 100.

Why is this happening? There are two possible sets of minimum operations to convert "test" into "t e s test t". python-Levenshtein chooses the one that involves deleting the continuous "test" substring. I don't see this as an issue with python-Levenshtein, as the calculation of Levenshtein distance doesn't explicitly give weight to continuous blocks.

There is potential for this bug to pop up any time that the shorter string is found as a non-continuous sequence in the longer string. This is unexpected, hard to account for, and potentially very problematic. I don't see this as being an isolated edge case. Perhaps it would be sensible to remove the use of python-Levenshtein until a solution is found? diffLib does seem to handle this better.

maxbachmann · 2021-05-02T11:52:17Z

@andrewguy I agree with this. I already use a implementation based on difflib in my implementation in RapidFuzz, which works correctly. The main argument for the implementation of python-Levenshtein appears to be it's performance. However the difflib based implementation in RapidFuzz is in no way slower.

@acslater00 @josegonzalez I would be willing to port my implementation to FuzzyWuzzy. Probably the simplest solution would be a get_matching_blocks API in RapidFuzz, so there is no addition maintenance effort for SeatGeek.
However looking at recent Pull Requests it appears like SeatGeek does not has the time to review any Pull Requests. In case your interested please let me know, since I do not want to waste my time on a solution that will not be reviewed.

xdrop mentioned this issue Nov 20, 2017

partialRatio issue xdrop/fuzzywuzzy#39

Closed

funytan mentioned this issue Feb 17, 2020

Faulty result of partial ratio (without python-Levenshtein) #264

Open

BobLd mentioned this issue Apr 14, 2020

Fix PartialRatio issue JakeBayer/FuzzySharp#27

Merged

maxbachmann mentioned this issue Apr 15, 2020

partial_ratio not using best aligned substring with python-Levenshtein #274

Closed

XDGFX mentioned this issue Aug 31, 2020

Partial_Ratio not working #279

Open

maxbachmann mentioned this issue May 2, 2021

Wired behavior of partial_ratio #313

Open

maxbachmann mentioned this issue Jul 11, 2021

The results differ significantly from FuzzyWuzzy rapidfuzz/RapidFuzz#112

Closed

maxbachmann mentioned this issue Jul 28, 2021

Installing python-Levenshtein as suggested by the warnings gives different results. #318

Open

maxbachmann mentioned this issue Nov 6, 2021

replace python-Levenshtein with rapidfuzz seatgeek/thefuzz#10

Merged

maxbachmann mentioned this issue Jan 18, 2022

Possible bug for partial_ratio? seatgeek/thefuzz#16

Open

maxbachmann mentioned this issue Feb 15, 2023

fuzz.partial_ratio result is wrong seatgeek/thefuzz#48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken partial_ratio functionality with python-Levenshtein #79

Broken partial_ratio functionality with python-Levenshtein #79

BorisVa commented Feb 23, 2015

stefdoerr commented Apr 2, 2016

stefdoerr commented Apr 2, 2016

urupvog commented Jun 21, 2016

josegonzalez commented Jun 21, 2016

acslater00 commented Jun 21, 2016

acslater00 commented Jun 28, 2016

eliot1019 commented Jun 25, 2018

josegonzalez commented Jun 25, 2018

andrewguy commented Mar 22, 2021

maxbachmann commented May 2, 2021 •

edited

Loading

Broken partial_ratio functionality with python-Levenshtein #79

Broken partial_ratio functionality with python-Levenshtein #79

Comments

BorisVa commented Feb 23, 2015

stefdoerr commented Apr 2, 2016

stefdoerr commented Apr 2, 2016

urupvog commented Jun 21, 2016

josegonzalez commented Jun 21, 2016

acslater00 commented Jun 21, 2016

acslater00 commented Jun 28, 2016

eliot1019 commented Jun 25, 2018

josegonzalez commented Jun 25, 2018

andrewguy commented Mar 22, 2021

maxbachmann commented May 2, 2021 • edited Loading

maxbachmann commented May 2, 2021 •

edited

Loading