Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

fuzz.ratio is not always commutative #173

Open
Alberth289346 opened this issue Aug 5, 2017 · 6 comments
Open

fuzz.ratio is not always commutative #173

Alberth289346 opened this issue Aug 5, 2017 · 6 comments

Comments

@Alberth289346
Copy link

One of the properties that you'd expect is that the order of the arguments doesn't matter, ie fuzz.ratio(A, B) == fuzz.ratio(B, A) for all strings A and B. This property is also listed at https://en.wikipedia.org/wiki/Edit_distance#Properties as one of the fundamental properties of editing distance.

While it seems to hold most of the times, unfortunately this is not always the case:

from fuzzywuzzy import fuzz

s1 = "tehchqt"
s2 = "tehctht"

print("distance {} -> {}: {}".format(s1, s2, fuzz.ratio(s1, s2)))
print("distance {} -> {}: {}".format(s2, s1, fuzz.ratio(s2, s1)))

gives

/usr/lib/python3.5/site-packages/fuzzywuzzy/fuzz.py:32: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
distance tehchqt -> tehctht: 86
distance tehctht -> tehchqt: 71

I haven't tested it with the levenshtein package.

@DavidCEllis
Copy link
Contributor

DavidCEllis commented Sep 30, 2017

With python-Levenshtein it appears to be commutative (at least for this example).

>>> from fuzzywuzzy.fuzz import ratio
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
86

But I can confirm that it is not commutative without python-Levenshtein.

>>> from fuzzywuzzy.fuzz import ratio
<<WARNINGS>>
>>> s1 = 'tehchqt'
>>> s2 = 'tehctht'
>>> ratio(s1, s2)
86
>>> ratio(s2, s1)
71

I think the issue here is using difflib which doesn't actually give true edit distances - from the docs

This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

If you put these examples through the SequenceMatcher in difflib you will get the same (but unrounded) results.

@Alberth289346
Copy link
Author

difflib sounds as a likely cause indeed. A relatively simple fix for this case could be to try both directions and take the min or max of them. That would fix the reported discrepancy, although it makes the problem of determining what is actually measured more complicated, in particular if difflib result and python-Levenshtein result are not the same.

In the mean time, we decided that normalized equality notion did not work for our project, and instead opted for true edit distance. As such we are not waiting for a fix of this issue.

@josegonzalez
Copy link
Contributor

Probably no good way to fix this if you are using difflib, and documenting that it isn't commutative is our best shot.

@akashbudhia
Copy link

How can I force fuzz.partial_ratio allow only 1st argument as substring and not the other way round.

For example following example will be giving misleading matches:
fuzz.partial_ratio("An Act of Wisdom Play", "Play") => 100%

@josegonzalez
Copy link
Contributor

I'm not implementing that in fuzzywuzzy, but you are free to do so in a wrapper function by just checking if one string is in another. Note that you'll definitely incur a performance penalty.

@akashbudhia
Copy link

akashbudhia commented Jan 5, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants