process is slow #239

gamesguru · 2019-06-08T17:29:38Z

takes 30 seconds to process 1.1 million product names

the npm library fuzzysort is much faster currently

maxbachmann · 2020-12-16T15:38:58Z

You should provide examples for people to reproduce your test. Some of the interesting points:

what's the ratio you used with process
did you use the faster version using python-Levenshtein
whats the score of fuzzysort based on

gamesguru · 2020-12-16T16:13:29Z

Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt

As for the other points,

I used token_set_ratio, and though it appears to be the slowest.. only by a factor of 1.5-2x, which means ratio or simple_ratio is still taking up to tens of seconds.
Yes, I have installed python-Levenshtein. I dispelled any such warning messages early on.
Not exactly sure, afaik she has implemented a custom in-place algorithm: https://github.com/farzher/fuzzysort/blob/master/fuzzysort.js

maxbachmann · 2020-12-16T17:07:56Z

I performed a quick test using the following test code:

setup="""
from {} import process, fuzz

with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
  text = fw.read()

words = text.split()
query = words[0]
words = words[1:]
"""

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))

This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz
The runtime I got was:

RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec

So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.

maxbachmann mentioned this issue Apr 2, 2023

replace python-Levenshtein with rapidfuzz seatgeek/thefuzz#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process is slow #239

process is slow #239

gamesguru commented Jun 8, 2019

maxbachmann commented Dec 16, 2020

gamesguru commented Dec 16, 2020

maxbachmann commented Dec 16, 2020 •

edited

Loading

process is slow #239

process is slow #239

Comments

gamesguru commented Jun 8, 2019

maxbachmann commented Dec 16, 2020

gamesguru commented Dec 16, 2020

maxbachmann commented Dec 16, 2020 • edited Loading

maxbachmann commented Dec 16, 2020 •

edited

Loading