You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.
As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.
As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.
And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.
from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75
If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?
The text was updated successfully, but these errors were encountered:
The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.
If python-levenshtein is installed, the "levenshtein ratio" is calculated.
This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."
The second line of the readme asserts:
It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package
Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.
As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the
SequenceMatcher
function from thedifflib
library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.
And when I run an example, I get the same result for
SequenceMatcher
andratio
function in fuzzywuzzy.If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?
The text was updated successfully, but these errors were encountered: