Skip to content
This repository has been archived by the owner on Aug 26, 2024. It is now read-only.

process is slow #239

Open
gamesguru opened this issue Jun 8, 2019 · 3 comments
Open

process is slow #239

gamesguru opened this issue Jun 8, 2019 · 3 comments

Comments

@gamesguru
Copy link

takes 30 seconds to process 1.1 million product names

the npm library fuzzysort is much faster currently

@maxbachmann
Copy link

You should provide examples for people to reproduce your test. Some of the interesting points:

  • what's the ratio you used with process
  • did you use the faster version using python-Levenshtein
  • whats the score of fuzzysort based on

@gamesguru
Copy link
Author

Hi, you could obtain data for example here, split by space into list: https://www.damienelliott.com/wp-content/uploads/2020/07/Lorem-ipsum-dolor-sit-amet.txt

As for the other points,

  1. I used token_set_ratio, and though it appears to be the slowest.. only by a factor of 1.5-2x, which means ratio or simple_ratio is still taking up to tens of seconds.
  2. Yes, I have installed python-Levenshtein. I dispelled any such warning messages early on.
  3. Not exactly sure, afaik she has implemented a custom in-place algorithm: https://github.com/farzher/fuzzysort/blob/master/fuzzysort.js

@maxbachmann
Copy link

maxbachmann commented Dec 16, 2020

I performed a quick test using the following test code:

setup="""
from {} import process, fuzz

with open("Lorem-ipsum-dolor-sit-amet.txt") as fw:
  text = fw.read()

words = text.split()
query = words[0]
words = words[1:]
"""

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None, score_cutoff=80)", setup=setup.format("rapidfuzz"), number=1
))

print(timeit(
"process.extract(query, words, scorer=fuzz.token_set_ratio, processor=None)", setup=setup.format("fuzzywuzzy"), number=1
))

This compares the runtime for FuzzyWuzzy and an improved version of these algorithms from RapidFuzz
The runtime I got was:

RapidFuzz: 1.22120649999124 sec
FuzzyWuzzy: 8.835493099992163 sec

So it might be enough for your requirements to use RapidFuzz. FuzzySort appears to use a completely different algorithm, that is not based on the Levenshtein distance. So it might be an option to add a similar algorithm.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants