This repository has been archived by the owner on Aug 26, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 876
process is slow #239
Comments
You should provide examples for people to reproduce your test. Some of the interesting points:
|
Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt As for the other points,
|
I performed a quick test using the following test code: setup="""
from {} import process, fuzz
with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
text = fw.read()
words = text.split()
query = words[0]
words = words[1:]
"""
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))
print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
)) This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz
So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
takes 30 seconds to process 1.1 million product names
the npm library fuzzysort is much faster currently
The text was updated successfully, but these errors were encountered: