The behavior of partial_ratio is unexpected #224

refraction-ray · 2018-12-05T06:49:52Z

fuzzywuzzy == 0.17.0 with Python 3.6
See instances below.

summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

from fuzzywuzzy import fuzz
# below is the group of matching with keywords exactly in the summary text

(
fuzz.partial_ratio("quantity", summary), 
fuzz.partial_ratio("measurements", summary), 
fuzz.partial_ratio("five partners on a single device", summary),
fuzz.partial_ratio("opened", summary), 
fuzz.partial_ratio("and", summary),  
fuzz.partial_ratio("reproducibility problems", summary)
)
# output (100, 33, 38, 17, 0, 33)

# below is the group of matching with keywords not in the summary text
(
fuzz.partial_ratio("noway", summary), 
fuzz.partial_ratio("no such way", summary), 
fuzz.partial_ratio("buy and sell", summary)
)
# output (20,27,33)

I can't fully understand the behavior of partial_ratio, based on the above experiments, it seems the returned matching scores are not so consistent.

The text was updated successfully, but these errors were encountered:

refraction-ray · 2018-12-10T07:30:09Z

Some updates.
Firstly, some thoughts on the implementation of partial_ratio in

fuzzywuzzy/fuzzywuzzy/fuzz.py

Lines 56 to 59 in 778162c

    
           for block in blocks: 
        
               long_start = block[1] - block[0] if (block[1] - block[0]) > 0 else 0 
        
               long_end = long_start + len(shorter) 
        
               long_substr = longer[long_start:long_end]

There is the possibility that the only block in blocks is of the form (len(shorter), len(longer), 0), i.e. there is not even one char matched between the two strings. In my humble opinion, the score returned in this case should be directly zero, instead of further comparison as the above code.
(Of course, if the blocks returned by get_matching_blocks is correct, then the further comparison above also returns zero which should be consistent.)

Secondly, the unexpected behavior of partial_ratio is from difflib.SequenceMatcher. See instance below.

from difflib import SequenceMatcher
# the same long string
summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'

m = SequenceMatcher(None, "and", summary)
m.get_matching_blocks()
# output
# [Match(a=3, b=602, size=0)]
# though "and" is in the long string of summary, the get_matching_blocks function fails to capture it
# and return only one Match with size=0 indicating no match for even one char.

By track this, I finally located the origin of the problem: the autojunk option in difflib.SequenceMatcher class. For long string (len>200), autojunk should set to be False, otherwise, quoted from the docstring

Optional arg autojunk should be set to False to disable the "automatic junk heuristic" that treats popular elements as junk.

By reading the source code of difflib, all chars in the string with occurrence more than 1% would be automatically deleted from the string if autojunk=True which is the default behavior!

In sum, to match keywords in long strings, we should perhaps provide the option of autojunk in the partial_ratio function. After locating the origin of the issue, I find it replicated with #214

rbrand21 · 2019-03-10T01:51:39Z

Curious if autoJunk will be added as an option to partial_ratio? I have same issue.

maxbachmann mentioned this issue May 5, 2021

Disable automatic junk heuristic of difflib #303

Closed

maxbachmann mentioned this issue Apr 2, 2023

replace python-Levenshtein with rapidfuzz seatgeek/thefuzz#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The behavior of partial_ratio is unexpected #224

The behavior of partial_ratio is unexpected #224

refraction-ray commented Dec 5, 2018

refraction-ray commented Dec 10, 2018 •

edited

Loading

rbrand21 commented Mar 10, 2019

The behavior of partial_ratio is unexpected #224

The behavior of partial_ratio is unexpected #224

Comments

refraction-ray commented Dec 5, 2018

refraction-ray commented Dec 10, 2018 • edited Loading

rbrand21 commented Mar 10, 2019

refraction-ray commented Dec 10, 2018 •

edited

Loading