Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

The behavior of partial_ratio is unexpected #224

Open
refraction-ray opened this issue Dec 5, 2018 · 2 comments
Open

The behavior of partial_ratio is unexpected #224

refraction-ray opened this issue Dec 5, 2018 · 2 comments

Comments

@refraction-ray
Copy link

fuzzywuzzy == 0.17.0 with Python 3.6
See instances below.

summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

from fuzzywuzzy import fuzz
# below is the group of matching with keywords exactly in the summary text

(
fuzz.partial_ratio("quantity", summary), 
fuzz.partial_ratio("measurements", summary), 
fuzz.partial_ratio("five partners on a single device", summary),
fuzz.partial_ratio("opened", summary), 
fuzz.partial_ratio("and", summary),  
fuzz.partial_ratio("reproducibility problems", summary)
)
# output (100, 33, 38, 17, 0, 33)

# below is the group of matching with keywords not in the summary text
(
fuzz.partial_ratio("noway", summary), 
fuzz.partial_ratio("no such way", summary), 
fuzz.partial_ratio("buy and sell", summary)
)
# output (20,27,33)

I can't fully understand the behavior of partial_ratio, based on the above experiments, it seems the returned matching scores are not so consistent.

@refraction-ray
Copy link
Author

refraction-ray commented Dec 10, 2018

Some updates.
Firstly, some thoughts on the implementation of partial_ratio in

for block in blocks:
long_start = block[1] - block[0] if (block[1] - block[0]) > 0 else 0
long_end = long_start + len(shorter)
long_substr = longer[long_start:long_end]

There is the possibility that the only block in blocks is of the form (len(shorter), len(longer), 0), i.e. there is not even one char matched between the two strings. In my humble opinion, the score returned in this case should be directly zero, instead of further comparison as the above code.
(Of course, if the blocks returned by get_matching_blocks is correct, then the further comparison above also returns zero which should be consistent.)

Secondly, the unexpected behavior of partial_ratio is from difflib.SequenceMatcher. See instance below.

from difflib import SequenceMatcher
# the same long string
summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

m = SequenceMatcher(None, "and", summary)
m.get_matching_blocks()
# output
# [Match(a=3, b=602, size=0)]
# though "and" is in the long string of summary, the get_matching_blocks function fails to capture it
# and return only one Match with size=0 indicating no match for even one char.

By track this, I finally located the origin of the problem: the autojunk option in difflib.SequenceMatcher class. For long string (len>200), autojunk should set to be False, otherwise, quoted from the docstring

Optional arg autojunk should be set to False to disable the "automatic junk heuristic" that treats popular elements as junk.

By reading the source code of difflib, all chars in the string with occurrence more than 1% would be automatically deleted from the string if autojunk=True which is the default behavior!

In sum, to match keywords in long strings, we should perhaps provide the option of autojunk in the partial_ratio function. After locating the origin of the issue, I find it replicated with #214

@rbrand21
Copy link

Curious if autoJunk will be added as an option to partial_ratio? I have same issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants