Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug for partial_ratio? #16

Open
EdgarMCR opened this issue Jan 7, 2022 · 5 comments
Open

Possible bug for partial_ratio? #16

EdgarMCR opened this issue Jan 7, 2022 · 5 comments

Comments

@EdgarMCR
Copy link

EdgarMCR commented Jan 7, 2022

Hi,
Thank you very much for this package, it is really great!

I noticed unexpected behavior for the below example. It is probably just me not understanding how partial_ratio is supposed to work but I want to mention it in case it is a bug.

Sometimes partial_ratio returns a ratio under one hundred even though there is an exact match. The below script shows this behavior. With a longer string it fails but if a shorten it, it returns the expected output.

Tested on Windows 10 with Python 3.8

If this is a bug and the solution is not obvious to you, let me know and I will have a look if I can find the issue.

import thefuzz
from thefuzz import fuzz


print('thefuzz.__version__ = {}'.format(thefuzz.__version__))

s1 = 'one, two'
s2 = 'If more than one Critical Illness is diagnosed at the same time, only one benefit will be payable. ' \
     'That benefit shall be based on the ' \
     'larger Benefit Amount of those diagnosed. Reoccurrence Diagnosis Benefit Once benefits have been paid for a ' \
     'Critical Illness, benefits are payable for that same Critical Illness up to  one two  times per Insured ' \
     'Individual per lifetime '

ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 50

s1 = 'one two'
ratio = fuzz.partial_ratio(s1, s2)
print('Exact match is in text but the ratio is ' + str(ratio)) # 57

s1 = 'one, two'
s2 = 'Critical Illness up to  one two  times per Insured Individual per lifetime'
ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 88

s1 = 'one two'
s2 = 'Critical Illness up to  one two  times per Insured Individual per lifetime'
ratio = fuzz.partial_ratio(s1, s2)
print(ratio) # 100
@MNassar17
Copy link

MNassar17 commented Jan 10, 2022

I have the same problem and it can be reproduced easily using this example

s1 = "paris buratta wasabi water melon diet sparkling subtotal s.chg"
s2 = "subtotal"
score = fuzz.partial_ratio(s1, s2)
print(score)

This problem happens when 'python-Levenshtein' is installed. when I removed it, the fuzz give the expected output of 100%. but it still has the same problem with other examples.

Any idea ?

@EdgarMCR
Copy link
Author

I tried with RapidFuzz and that give the correct result. This seems to suggest this is a bug. For me, RapidFuzz was a drop-in replacement.

@MNassar17
Copy link

Thank you so much, but the problem with RapidFuzz is that it doesn't have the StringMatcher class so I can't get the exact matching position within the string. do you have any idea about this ?

@maxbachmann
Copy link
Contributor

maxbachmann commented Jan 18, 2022

This is caused by ztane/python-Levenshtein#16.
It is worth noting that this is no bug in python-Levenshtein. get_matching_blocks in python-Levenshtein has a compatible API to difflib. However, it does NOT guarantee that it will include the longest common substring. It only guarantees that the result follows an optimal alignment according to the Levenshtein distance (which is not the case for difflib). So fuzzywuzzy/thefuzz simply misuses the library.

Thank you so much, but the problem with RapidFuzz is that it doesn't have the StringMatcher class so I can't get the exact matching position within the string. do you have any idea about this ?

Assuming you search for a drop in replacement for difflib, you might be interested in cydifflib (https://github.com/maxbachmann/CyDifflib), which is a fast replacement for difflib.
In addition, Rapidfuzz provides an API rapidfuzz.string_metric.levenshtein_editops, which returns one of the optimal alignments (similar to python-Levenshtein this does not always include the longest common substring)

Note that this issue is well known for fuzzywuzzy: seatgeek/fuzzywuzzy#79
However, for SeatGeek this appears to work well enough (they value performance over incorrect results in some cases)

@purplecrow2020
Copy link

purplecrow2020 commented Jul 21, 2022

@MNassar17 sorry its out of refrence but using the exact matching position what sort of benefits do u extract, exploring for some work use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants