Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between different implementations of Levenshtein #47

Closed
angelosalatino opened this issue Feb 1, 2023 · 5 comments
Closed

Mismatch between different implementations of Levenshtein #47

angelosalatino opened this issue Feb 1, 2023 · 5 comments
Labels
question Further information is requested

Comments

@angelosalatino
Copy link

Hi there,

thank you for this work. I love this library because it is 100x faster than its competitors (e.g., strsimpy).

However, I have noticed that for the same couple of words, your implementation returns a different value of similarity.

from strsimpy.normalized_levenshtein import NormalizedLevenshtein
import Levenshtein


a = 'database system'
b = 'database systems'

print("Ration from Strsimpy")
normalized_levenshtein = NormalizedLevenshtein()
print(normalized_levenshtein.similarity(a, b))   

print("Ration from Levenshtein")
print(Levenshtein.seqratio(a,b)) 

as a result I get:

Ration from Strsimpy
0.9375
Ration from Levenshtein
0.967741935483871

I have checked with online tools and seems like that the similarity between a and b is 0.9375 (check here https://awsm-tools.com/levenshtein-distance?form%5Bsource%5D=database+system&form%5Btarget%5D=database+systems) which is in line with Strsimpy.

Do you know we get different values of similarity?

Thank a lot
Angelo

@maxbachmann
Copy link
Member

Levenshtein.ratio is based upon the Indel distance and not on the Levenshtein distance. This is basically the same as the Levenshtein distance without substitutions. You can get the normalized Levenshtein distance using rapidfuzz:

>>> from rapidfuzz.distance import Levenshtein, Indel
>>> Levenshtein.normalized_similarity(a, b)
0.9375
>>> Levenshtein.normalized_similarity(a, b, weights=(1,1,2))
0.967741935483871
>>> Indel.normalized_similarity(a, b)
0.967741935483871

when processing large amounts of data I recommend using one of the processing functions in rapidfuzz.process. They are a lot faster, since they do not need to switch between Python and C++ for each string comparision.

@maxbachmann maxbachmann added the question Further information is requested label Feb 1, 2023
@angelosalatino
Copy link
Author

Thanks for the clarification and the lead. I will test this one. Thanks again

@R-N
Copy link

R-N commented Jun 5, 2023

Why don't Levenshtein library use Levenshtein for ratio?

@maxbachmann
Copy link
Member

Why don't Levenshtein library use Levenshtein for ratio?

The original author implemented it like this for some unknown reason and I kept it like this for backwards compatibility. I agree it is pretty confusing, so at least I made sure to mention the indel distance in the documentation.

@R-N
Copy link

R-N commented Jun 5, 2023

Why don't Levenshtein library use Levenshtein for ratio?

The original author implemented it like this for some unknown reason and I kept it like this for backwards compatibility. I agree it is pretty confusing, so at least I made sure to mention the indel distance in the documentation.

I think it should be mentioned in the readme too, where people would read first. I mean, I would expect Levenshtein library to use Levenshtein unless specified otherwise. It's not like "Indel" is in the function name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants