Possible bug for partial_ratio? #16

EdgarMCR · 2022-01-07T13:01:21Z

Hi,
Thank you very much for this package, it is really great!

I noticed unexpected behavior for the below example. It is probably just me not understanding how partial_ratio is supposed to work but I want to mention it in case it is a bug.

Sometimes partial_ratio returns a ratio under one hundred even though there is an exact match. The below script shows this behavior. With a longer string it fails but if a shorten it, it returns the expected output.

Tested on Windows 10 with Python 3.8

If this is a bug and the solution is not obvious to you, let me know and I will have a look if I can find the issue.

import thefuzz
from thefuzz import fuzz


print('thefuzz.__version__ = {}'.format(thefuzz.__version__))

s1 = 'one, two'
s2 = 'If more than one Critical Illness is diagnosed at the same time, only one benefit will be payable. ' \
     'That benefit shall be based on the ' \
     'larger Benefit Amount of those diagnosed. Reoccurrence Diagnosis Benefit Once benefits have been paid for a ' \
     'Critical Illness, benefits are payable for that same Critical Illness up to  one two  times per Insured ' \
     'Individual per lifetime '

ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 50

s1 = 'one two'
ratio = fuzz.partial_ratio(s1, s2)
print('Exact match is in text but the ratio is ' + str(ratio)) # 57

s1 = 'one, two'
s2 = 'Critical Illness up to  one two  times per Insured Individual per lifetime'
ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 88

s1 = 'one two'
s2 = 'Critical Illness up to  one two  times per Insured Individual per lifetime'
ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 100

The text was updated successfully, but these errors were encountered:

MNassar17 · 2022-01-10T19:57:59Z

I have the same problem and it can be reproduced easily using this example

s1 = "paris buratta wasabi water melon diet sparkling subtotal s.chg"
s2 = "subtotal"
score = fuzz.partial_ratio(s1, s2)
print(score)

This problem happens when 'python-Levenshtein' is installed. when I removed it, the fuzz give the expected output of 100%. but it still has the same problem with other examples.

Any idea ?

EdgarMCR · 2022-01-10T20:11:54Z

I tried with RapidFuzz and that give the correct result. This seems to suggest this is a bug. For me, RapidFuzz was a drop-in replacement.

MNassar17 · 2022-01-10T21:00:24Z

Thank you so much, but the problem with RapidFuzz is that it doesn't have the StringMatcher class so I can't get the exact matching position within the string. do you have any idea about this ?

maxbachmann · 2022-01-18T11:34:25Z

This is caused by ztane/python-Levenshtein#16.
It is worth noting that this is no bug in python-Levenshtein. get_matching_blocks in python-Levenshtein has a compatible API to difflib. However, it does NOT guarantee that it will include the longest common substring. It only guarantees that the result follows an optimal alignment according to the Levenshtein distance (which is not the case for difflib). So fuzzywuzzy/thefuzz simply misuses the library.

Thank you so much, but the problem with RapidFuzz is that it doesn't have the StringMatcher class so I can't get the exact matching position within the string. do you have any idea about this ?

Assuming you search for a drop in replacement for difflib, you might be interested in cydifflib (https://github.com/maxbachmann/CyDifflib), which is a fast replacement for difflib.
In addition, Rapidfuzz provides an API rapidfuzz.string_metric.levenshtein_editops, which returns one of the optimal alignments (similar to python-Levenshtein this does not always include the longest common substring)

Note that this issue is well known for fuzzywuzzy: seatgeek/fuzzywuzzy#79
However, for SeatGeek this appears to work well enough (they value performance over incorrect results in some cases)

purplecrow2020 · 2022-07-21T08:52:17Z

@MNassar17 sorry its out of refrence but using the exact matching position what sort of benefits do u extract, exploring for some work use case

maxbachmann mentioned this issue Feb 15, 2023

replace python-Levenshtein with rapidfuzz #10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug for partial_ratio? #16

Possible bug for partial_ratio? #16

EdgarMCR commented Jan 7, 2022 •

edited

Loading

MNassar17 commented Jan 10, 2022 •

edited

Loading

EdgarMCR commented Jan 10, 2022

MNassar17 commented Jan 10, 2022

maxbachmann commented Jan 18, 2022 •

edited

Loading

purplecrow2020 commented Jul 21, 2022 •

edited

Loading

Possible bug for partial_ratio? #16

Possible bug for partial_ratio? #16

Comments

EdgarMCR commented Jan 7, 2022 • edited Loading

MNassar17 commented Jan 10, 2022 • edited Loading

EdgarMCR commented Jan 10, 2022

MNassar17 commented Jan 10, 2022

maxbachmann commented Jan 18, 2022 • edited Loading

purplecrow2020 commented Jul 21, 2022 • edited Loading

EdgarMCR commented Jan 7, 2022 •

edited

Loading

MNassar17 commented Jan 10, 2022 •

edited

Loading

maxbachmann commented Jan 18, 2022 •

edited

Loading

purplecrow2020 commented Jul 21, 2022 •

edited

Loading