Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

Partial_Ratio not working #279

Open
aW3st opened this issue Aug 27, 2020 · 5 comments
Open

Partial_Ratio not working #279

aW3st opened this issue Aug 27, 2020 · 5 comments

Comments

@aW3st
Copy link

aW3st commented Aug 27, 2020

Having some weird issues using partial ratio. Here's the code:

test_string = ('completed transactions settlement date trade date '
               'symbol name transaction type account type quantity price commissions & fees amount '
               '12/23 12/23 dividend '
               'appreciation etf dividend - - - $441.99 12/23 12/23 '
               'vig dividend appreciation etf reinvestment cash')

'etf' in test_string # returns True
fuzz.partial_ratio('etf', test_string)

without python-levenshtein this returns 33, with python levenshtein 67. My understanding of the method is that it should be 100, since there's a substring that's a perfect match. Any ideas?

(on python 3.8, btw)

@XDGFX
Copy link

XDGFX commented Aug 31, 2020

I'm having the same issue, I would also expect a score of 100 with the below function

>>> artists_a
'carvar & clock'
>>> artists_b
'carvar clock'
>>> fuzz.partial_ratio(artists_a, artists_b)
83
>>> fuzz.partial_ratio(artists_b, artists_a)
83

I also tried without python-Levenshtein as suggested in #79 but exact same result.

@XDGFX
Copy link

XDGFX commented Aug 31, 2020

Possibly replace partial_ratio with partial_token_sort_ratio, as mentioned on this stackoverflow answer. In both our examples it seemed to work as expected.

@maxbachmann
Copy link

maxbachmann commented Sep 1, 2020

partial_ratio searches for the best alignment between two strings and the calculates the fuzz.ratio for this alignment. So while in @aW3st case the word 'etf' is part of the second string therefore you would expect the result 100, thats not the case in your example @XDGFX.
When comparing 'carvar & clock' and 'carvar clock' they are no substring of each other. However when using partial_token_sort_ratio it works since it resorts the words to 'carvar clock &' and 'carvar clock'. So afterwards 'carvar clock' is a substring of 'carvar clock &' ;)

@aW3st you tried both with python-Levenshtein and without and both have wrong results for different reasons.

  1. Python-Levenshtein has a known bug with finding the optimal alignment between strings, which is probably the bug your encountering here aswell. You can find this here: Broken partial_ratio functionality with python-Levenshtein #79 (comment)
  2. when not using python-Levenshtein fuzzywuzzy falls back to difflib. Here the problem appears to occur when using the automatic junk heuristic of difflib which is activated by default. So it would be required to change
    m = SequenceMatcher(None, shorter, longer)

    to
m = SequenceMatcher(None, shorter, longer, False)

As a sidenote my library rapidfuzz provides the same string matching algorithm without this problem, so your example string returns a score of 100 as you expected

@aW3st
Copy link
Author

aW3st commented Sep 1, 2020

Thanks Max, I'll give your library a shot!

@thomkav
Copy link

thomkav commented Sep 1, 2020

@maxbachmann Hi Max, I'm working with @aW3st on a project. We've swapped fuzzywuzzy for your library, and we're seeing great performance. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants