You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Other, Transforms/Other
What happened + What you expected to happen
from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
output_folder= "files-rep_removal",
rep_removal_contents_column_name='text',
rep_removal_num_threads=1,
).transform()
12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)
cpu speed: 3228 MHz, Cores: 10
12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.
gpu_usage: 0.00%, GPU speed: 0 MHz
Thank you, @shivdeep-singh-ibm ! We talked about this exact solution yesterday. Thanks for spelling it out.
@Hajar-Emami This is exactly what we were talking about yesterday!
If you add resize to the list of data-prep-toolkit-transforms that you pip install, then you can use it exactly as Shivdeep has above, to make the file smaller (e.g., 1000 rows) before rep_removal.
Many Thanks @shivdeep-singh-ibm . Yes, as we discussed with @shahrokhDaijavad and @touma-I, we should include the Resize step before running any of GneissWeb recipe's components.
Search before asking
Component
Other, Transforms/Other
What happened + What you expected to happen
Reproduction script
Run the following on a Mac M1 with 16GB memory
Anything else
No response
OS
MacOS (limited support)
Python
3.10.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: