Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Clarification between the Levenshtein Distance algorithm and how it is implemented here #248

Open
samarth12 opened this issue Oct 2, 2019 · 2 comments

Comments

@samarth12
Copy link

samarth12 commented Oct 2, 2019

I am trying to wrap my head around how it calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.

The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.

Here is are a couple of examples:

Example 1
s1 = 'hello'
s2 = 'hell'

Levenshtein Score = 1 (it requires 1 edit, addition of 'o')

Example 2
s1 = 'hello'
s2 = 'hella'

Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')

Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):

Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%

Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?

I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?

Would love if somebody could give me some insight, thank you!

@taraldb
Copy link

taraldb commented Nov 27, 2019

Hi,

It uses python Levenshtein as a replacement for difflib.
Implementation uses levenshtein for speedup of methods, but doesn't actually return LD.

Ratio is based on real minimal edit distance.

@maxbachmann
Copy link

maxbachmann commented Dec 16, 2020

The Levenshtein distance implementation uses the following weights:

  • insertion: 1
  • deletion: 1
  • replacement: 2

and normalizes the result the following way: 100 * (1 - distance / (len(s1)+len(s2)))
this is closer to

this provides results, that are very close to the results of difflib.

In fact at least in the german wikipedia article of the levenshtein distance this is listed as a variant called weighted levenshtein distance. The "normal" Levenshtein distance would be normalised using 100 * (1 - distance / max(len(s1)+len(s2)))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants