Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

pras135 · 2018-12-30T18:28:07Z

As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?

The text was updated successfully, but these errors were encountered:

josegonzalez · 2018-12-30T23:01:43Z

@acslater00 Can you answer this?

nol13 · 2019-01-01T14:36:15Z

if python-Levenshtein is installed, it uses '(lensum - ldist)/(lensum))'

where ldist is calculated with replacement cost of 2.

https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c#L760

dmarx · 2020-05-26T17:26:42Z

So to summarize:

The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.
If python-levenshtein is installed, the "levenshtein ratio" is calculated.

This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."

The second line of the readme asserts:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package

Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.

josegonzalez · 2020-05-28T20:45:31Z

Well seems like a good PR for someone to file to the readme to clarify what it does.

maxbachmann mentioned this issue Apr 2, 2023

replace python-Levenshtein with rapidfuzz seatgeek/thefuzz#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

pras135 commented Dec 30, 2018

josegonzalez commented Dec 30, 2018

nol13 commented Jan 1, 2019

dmarx commented May 26, 2020

josegonzalez commented May 28, 2020

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

Comments

pras135 commented Dec 30, 2018

josegonzalez commented Dec 30, 2018

nol13 commented Jan 1, 2019

dmarx commented May 26, 2020

josegonzalez commented May 28, 2020