You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.
I am trying to wrap my head around how it calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
Would love if somebody could give me some insight, thank you!
The text was updated successfully, but these errors were encountered:
The Levenshtein distance implementation uses the following weights:
insertion: 1
deletion: 1
replacement: 2
and normalizes the result the following way: 100 * (1 - distance / (len(s1)+len(s2)))
this is closer to
this provides results, that are very close to the results of difflib.
In fact at least in the german wikipedia article of the levenshtein distance this is listed as a variant called weighted levenshtein distance. The "normal" Levenshtein distance would be normalised using 100 * (1 - distance / max(len(s1)+len(s2)))
I am trying to wrap my head around how it calculates the Levenshtein Distance between two strings, as the docs clearly mention that it is using that.
The Levenshtein Distance algorithm counts looks for the minimum number of edits between the two strings. That can be achieved using addition, deletion, and substitution of a character in the string. All these operations are counted as a single operation when calculating the score.
Here is are a couple of examples:
Example 1
s1 = 'hello'
s2 = 'hell'
Levenshtein Score = 1 (it requires 1 edit, addition of 'o')
Example 2
s1 = 'hello'
s2 = 'hella'
Levenshtein Score = 1 (it requires 1 edit, substitution of 'a' to 'o')
Plugging these scores into the Fuzzywuzzy formula (len(s1)+len(s2) - LevenshteinScore)/((len(s1)+len(s2)):
Example 1: (5+4-1)/9 = 89%
Example 2: (5+5-1)/10 = 90%
Now the fuzzywuzzy does return the same score for Example 1, but not for example 2. The score for example 2 is 80%. On investigating how it is calculating the distances under the hood, I found out that it counts the 'substitution' operation as 2 operations rather than 1 (as defined for Levenshtein). I understand that it uses the difflib library but I just want to know why is it called Levenshtein Distance, when it actually is not?
I am just trying to figure out why is there a distinction here? What does it mean or explain? Basically the reason for using 2 operations for substitution rather than one as defined in Levenshtein Distance and still calling it Levenshtein Distance. Is it got something to do with the gaps in sentences? Is this a standard way of converting LD to a normalized similarity score?
Would love if somebody could give me some insight, thank you!
The text was updated successfully, but these errors were encountered: