Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

samarth12 · 2019-10-02T18:44:08Z

I am trying to wrap my head around how it calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.

The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.

Here is are a couple of examples:

Example 1
s1 = 'hello'
s2 = 'hell'

Levenshtein Score = 1 (it requires 1 edit, addition of 'o')

Example 2
s1 = 'hello'
s2 = 'hella'

Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')

Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):

Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%

Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?

I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?

Would love if somebody could give me some insight, thank you!

taraldb · 2019-11-27T09:48:44Z

Hi,

It uses python Levenshtein as a replacement for difflib.
Implementation uses levenshtein for speedup of methods, but doesn't actually return LD.

Ratio is based on real minimal edit distance.

maxbachmann · 2020-12-16T15:50:11Z

The Levenshtein distance implementation uses the following weights:

insertion: 1
deletion: 1
replacement: 2

and normalizes the result the following way: 100 * (1 - distance / (len(s1)+len(s2)))
this is closer to

this provides results, that are very close to the results of difflib.

In fact at least in the german wikipedia article of the levenshtein distance this is listed as a variant called weighted levenshtein distance. The "normal" Levenshtein distance would be normalised using 100 * (1 - distance / max(len(s1)+len(s2)))

maxbachmann mentioned this issue Apr 2, 2023

replace python-Levenshtein with rapidfuzz seatgeek/thefuzz#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

samarth12 commented Oct 2, 2019 •

edited

Loading

taraldb commented Nov 27, 2019

maxbachmann commented Dec 16, 2020 •

edited

Loading

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

Comments

samarth12 commented Oct 2, 2019 • edited Loading

taraldb commented Nov 27, 2019

maxbachmann commented Dec 16, 2020 • edited Loading

samarth12 commented Oct 2, 2019 •

edited

Loading

maxbachmann commented Dec 16, 2020 •

edited

Loading