Rep_removal for large data files crashes on 16GB memory #1035

shahrokhDaijavad · 2025-02-10T22:25:17Z

Search before asking

I searched the issues and found no similar issues.

Component

Other, Transforms/Other

What happened + What you expected to happen

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)

cpu speed: 3228 MHz, Cores: 10

12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.

gpu_usage: 0.00%, GPU speed: 0 MHz

Reproduction script

Run the following on a Mac M1 with 16GB memory

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

shivdeep-singh-ibm · 2025-02-12T15:08:31Z

Since, we are get out of memory while using this transform.
we should two transforms in a sequence..

a) Resize
b) RepRemoval.

Something like

we can choose resize_max_rows_per_table in a way such that it does not give oom error.

import os
REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

from dpk_resize.runtime import Resize
Resize(input_folder= os.path.dirname(file1),
        output_folder= "output",
        resize_max_rows_per_table= 1000).transform()

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= "output",
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

shahrokhDaijavad · 2025-02-12T15:11:47Z

Thank you, @shivdeep-singh-ibm ! We talked about this exact solution yesterday. Thanks for spelling it out.

@Hajar-Emami This is exactly what we were talking about yesterday!
If you add resize to the list of data-prep-toolkit-transforms that you pip install, then you can use it exactly as Shivdeep has above, to make the file smaller (e.g., 1000 rows) before rep_removal.

Hajar-Emami · 2025-02-12T19:04:55Z

Many Thanks @shivdeep-singh-ibm . Yes, as we discussed with @shahrokhDaijavad and @touma-I, we should include the Resize step before running any of GneissWeb recipe's components.

shahrokhDaijavad added the bug Something isn't working label Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rep_removal for large data files crashes on 16GB memory #1035

Rep_removal for large data files crashes on 16GB memory #1035

shahrokhDaijavad commented Feb 10, 2025

shivdeep-singh-ibm commented Feb 12, 2025

shahrokhDaijavad commented Feb 12, 2025 •

edited

Loading

Hajar-Emami commented Feb 12, 2025 •

edited

Loading

Rep_removal for large data files crashes on 16GB memory #1035

Rep_removal for large data files crashes on 16GB memory #1035

Comments

shahrokhDaijavad commented Feb 10, 2025

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

shivdeep-singh-ibm commented Feb 12, 2025

shahrokhDaijavad commented Feb 12, 2025 • edited Loading

Hajar-Emami commented Feb 12, 2025 • edited Loading

shahrokhDaijavad commented Feb 12, 2025 •

edited

Loading

Hajar-Emami commented Feb 12, 2025 •

edited

Loading