-
Notifications
You must be signed in to change notification settings - Fork 876
The behavior of partial_ratio is unexpected #224
Comments
Some updates. Lines 56 to 59 in 778162c
There is the possibility that the only block in blocks is of the form (len(shorter), len(longer), 0) , i.e. there is not even one char matched between the two strings. In my humble opinion, the score returned in this case should be directly zero, instead of further comparison as the above code.(Of course, if the blocks returned by get_matching_blocks is correct, then the further comparison above also returns zero which should be consistent.)
Secondly, the unexpected behavior of from difflib import SequenceMatcher
# the same long string
summary = 'the rising field of spin caloritronics focuses on the interactions between spin and heat currents in a magnetic material; the observation of the spin seebeck effect opened the route to this branch of research. this paper reports the results of a round robin test performed by five partners on a single device highlighting the reproducibility problems related to the measurements of the spin seebeck coefficient, the quantity that describes the strength of the spin seebeck effect. this work stimulated the search for more reproducible measurement methods through the analysis of the systematic effects.'
m = SequenceMatcher(None, "and", summary)
m.get_matching_blocks()
# output
# [Match(a=3, b=602, size=0)]
# though "and" is in the long string of summary, the get_matching_blocks function fails to capture it
# and return only one Match with size=0 indicating no match for even one char. By track this, I finally located the origin of the problem: the
By reading the source code of difflib, all chars in the string with occurrence more than 1% would be automatically deleted from the string if In sum, to match keywords in long strings, we should perhaps provide the option of |
Curious if autoJunk will be added as an option to |
fuzzywuzzy == 0.17.0 with Python 3.6
See instances below.
I can't fully understand the behavior of
partial_ratio
, based on the above experiments, it seems the returned matching scores are not so consistent.The text was updated successfully, but these errors were encountered: