Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong (non symmetric) results for some strings if Levenshtein module is not present #5

Open
spotamianos opened this issue Oct 6, 2021 · 8 comments

Comments

@spotamianos
Copy link

If the python-Levenshtein library is not installed, then fuzz.ratio('abc', 'cadb') returns 57 and fuzz.ratio('cadb', 'abc') returns 29.
If the python-Levenshtein library is used, then both calls return 57.

Similar errors happen for other strings too, e.g:
'abcd', 'cbda'
'ONYZBOHON', 'ZRKFULFORD'

@spotamianos
Copy link
Author

Another example is strings "9038119840000001" and "AS19970718106866036". We get 3 different answers:
levenshtein library: 34
no levensthein library: 17
no levenshtein library and order reversed: 23

@rnegron
Copy link

rnegron commented Nov 30, 2021

This is probably because without the Levenshtein library, this package uses a different algorithm from the standard library'sdifflib. See this issue from the previous name of this project: seatgeek/fuzzywuzzy#128

@saeed2402
Copy link

saeed2402 commented Aug 25, 2022

This is all very confusing, could someone please explain what "If the python-Levenshtein library is not installed" means? Does it mean if both "thefuzz" and "python-Levenshtein" are installed on the machine then we should get the same results for "fuzz.ratio('abc', 'cadb') " and "fuzz.ratio('cadb', 'abc') "?
I have both of them installed, and still get different results..

@maxbachmann
Copy link
Contributor

Yes if both are installed you should get the same results:

>>> from thefuzz import fuzz
>>> fuzz.ratio('abc', 'cadb')
57
>>> fuzz.ratio('cadb', 'abc')
57

@saeed2402
Copy link

@maxbachmann I am using this library in Amazon Redshift in a Python UDF. Both libraries are installed on my Amazon Redshift cluster, running "select * from pg_library;" gives me "python_levenshtein, setuptools and thefuzz"

and still get 29 for "SELECT f_fuzzy_string_match( 'cadb', 'abc');"

@maxbachmann
Copy link
Contributor

Then you do not have a valid installation of python-Levenshtein. You can check whether

from thefuzz import fuzz
import difflib
print(fuzz.SequenceMatcher == difflib.SequenceMatcher)

@saeed2402
Copy link

@maxbachmann Something is not right here mate. In "StringMatcher.py", line 11 is "from Levenshtein import *", which looks like the project you're maintaining on: https://github.com/maxbachmann/Levenshtein
Whereas yours and all other references are for https://pypi.org/project/python-Levenshtein/ , which apparently is not maintained anymore.
Can you please clarify which one is the correct one to have installed?
Also, your project "Levenshtein" errors out when used with Redshift Python UDF:

ERROR: File "/rdsdbdata/user_lib/1/0/105733.zip/Levenshtein/init.py", line 17 author: str = "Max Bachmann" ^ SyntaxError: invalid syntax. Please look at svl_udf_log for more information Detail: ----------------------------------------------- error: File "/rdsdbdata/user_lib/1/0/105733.zip/Levenshtein/init.py", line 17 author: str = "Max Bachmann" ^ SyntaxError: invalid syntax. Please look at svl_udf_log for more information code: 10000 context: UDF query: 0 location: udf_client.cpp:364 process: padbmaster [pid=3125] ----------------------------------------------- [ErrorId: 1-63081de5-55324c9c41e8b72f3fdf7abb]

@maxbachmann
Copy link
Contributor

both should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants