Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Faulty result of partial ratio (without python-Levenshtein) #264

Open
funytan opened this issue Feb 17, 2020 · 3 comments
Open

Faulty result of partial ratio (without python-Levenshtein) #264

funytan opened this issue Feb 17, 2020 · 3 comments

Comments

@funytan
Copy link

funytan commented Feb 17, 2020

It is known that partial_ratio calculation yields incorrect results for some combinations of strings when it uses the python-Levenshtein SequenceMatcher #79 (comment)

However after removing it, for certain string cases, fuzzywuzzy without python-Levenshtein does not work.

> fuzz.partial_ratio('home sweet home', ' home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

> 13

And interesting enough, installing python-Levenshtein gives the correct score of 100.

This problem seems to happen when the comparison is made between a short and much longer string.

Has anyone faced this before?

@funytan funytan changed the title Behavior of partial ratio without python-Levenshtein Faulty result of partial ratio (without python-Levenshtein) Feb 17, 2020
@aniketcomps
Copy link

aniketcomps commented Mar 5, 2020

I noticed if you delete the preceding space in the longer string, then expected score of 100 is achieved. I couldn't figure out why.
If your purpose is to get similarity involving long string then removing preceding and trailing spaces just might do the trick,
PS: I am using pure-python Sequence matcher and not python-Levenshtein

fuzz.partial_ratio('home sweet home', 'home sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[32]: 100

@funytan
Copy link
Author

funytan commented Mar 12, 2020

@aniketcomps thanks! It works fine when deleting the preceding space, but when I tried to remove that space and the word and space after that, it fails again! Haha. Im using pure-python Sequence matcher as well.

fuzz.partial_ratio('home sweet home', 'sweet home no.12, fsfaf, fsffs, fsdf fsdf, sf, sfs. jl. home sweet home df.df, sdfds, sdf. sdf, sdf sdf, sdf, sdfdf. df home sweet home no.12, fsdf, sdfd, sdf fdsf, sdf, sdf. jl. home sweet home, fdsf, sdf, sdf sdf, sdf, fdg. jl. home sweet home no.12, gfg, fg, fg fg, df, gfg. gg. df, df, df df, df df, df. df. home sweet home df.12, df, df, df df, df, df. df. home sweet home, df, df. df, df df, df, df. gg. df df. home sweet home, df, df. df, df df, df, df. df. home sweet home no.12, df, df, df df, df, df. df. home sweet home df.df, df, df, df df, df, df')

Out[773]: 13

@maxbachmann
Copy link

As a I described here: #279
this is most likely caused by the automatic junk heuristic of difflib which is not deactivated by fuzzywuzzy

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants