Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Is it using Levenshtein distance or the Ratcliff/Obershelp pattern-matching algorithm? #225

Open
pras135 opened this issue Dec 30, 2018 · 4 comments

Comments

@pras135
Copy link

pras135 commented Dec 30, 2018

As per the documentation of the library, it is mentioned that it uses Levenshtein distance for computing the differences between sequences. But upon close inspection, I find that it actually uses the SequenceMatcher function from the difflib library. And this function, as per documentation uses the Ratcliff/Obershelp pattern-matching algorithm.

As per the definitions, Levenshtein distance is the minimum number of edits needed to transform one string into the other, and Ratcliff/Obershelp pattern-matching algorithm computes the doubled number of matching characters divided by the total number of characters in the two strings. A close related post comparing both.

And when I run an example, I get the same result for SequenceMatcher and ratio function in fuzzywuzzy.

from difflib import SequenceMatcher
from fuzzywuzzy import fuzz
s = SequenceMatcher(None, "abcd", "bcde")
s.ratio()
# 0.75
fuzz.ratio("abcd", "bcde")
# 75

If I compute the Levenshtein distance manually between the two strings, I guess it will be just 2. In this case, how does it become that it uses Levenshtein distance as it is mentioned in the documentation?

@josegonzalez
Copy link
Contributor

@acslater00 Can you answer this?

@nol13
Copy link
Contributor

nol13 commented Jan 1, 2019

if python-Levenshtein is installed, it uses '(lensum - ldist)/(lensum))'

where ldist is calculated with replacement cost of 2.

https://github.com/ztane/python-Levenshtein/blob/master/Levenshtein/_levenshtein.c#L760

@dmarx
Copy link

dmarx commented May 26, 2020

So to summarize:

  1. The default behavior doesn't compute levenshtein distance but instead computes Ratcliff/Obershelp match ratio, which is similar but not identical to jaccard similarity.

  2. If python-levenshtein is installed, the "levenshtein ratio" is calculated.

This behavior doesn't appear to be documented anywhere beyond the abstract warning in the readme that installing python-levenshtein "may result in differing results for certain cases."

The second line of the readme asserts:

It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package

Considering the default behavior never computes levenshtein, this behavior and the calculations used for computing different similarity functions should be more clearly documented.

@josegonzalez
Copy link
Contributor

Well seems like a good PR for someone to file to the readme to clarify what it does.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants