Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rep_removal for large data files crashes on 16GB memory #1035

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Feb 10, 2025 · 3 comments
Open
1 of 2 tasks

Rep_removal for large data files crashes on 16GB memory #1035

shahrokhDaijavad opened this issue Feb 10, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Other, Transforms/Other

What happened + What you expected to happen

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

12:11:53 INFO - pipeline id pipeline_id
12:11:53 INFO - code location None
12:11:53 INFO - data factory data_ is using local data access: input_folder - [/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20](http://localhost:8888/Users/shahrokhdaijavad/.cache/huggingface/hub/datasets--HuggingFaceFW--fineweb/snapshots/0f039043b23fe1d4eed300b504aa4b4a68f1c7ba/data/CC-MAIN-2013-20) output_folder - files-rep_removal
12:11:53 INFO - data factory data_ max_files -1, n_sample -1
12:11:53 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:11:53 INFO - orchestrator rep_removal started at 2025-02-10 12:11:53
12:11:53 INFO - Number of files is 1, source profile {'max_file_size': 2048.0454998016357, 'min_file_size': 2048.0454998016357, 'total_file_size': 2048.0454998016357}
12:12:16 INFO - encoding parquet
12:51:53 INFO - making suffix array
12:51:53 INFO - Starting the deduplication process for file: [/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet](http://localhost:8888/var/folders/7f/dcj_kvt1153fqpphsj6jj8w40000gn/T/tmp2_wwama8/save_dir/parquet)

cpu speed: 3228 MHz, Cores: 10

12:51:53 INFO - timeout is: 45743.31654275093
12:51:53 INFO - Scheduling 96 jobs to create dataset parts.

gpu_usage: 0.00%, GPU speed: 0 MHz

Reproduction script

Run the following on a Mac M1 with 16GB memory

REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")
from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= os.path.dirname(file1),
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the bug Something isn't working label Feb 10, 2025
@shivdeep-singh-ibm
Copy link
Collaborator

Since, we are get out of memory while using this transform.
we should two transforms in a sequence..

a) Resize
b) RepRemoval.

Something like

we can choose resize_max_rows_per_table in a way such that it does not give oom error.

import os
REPO_ID = "HuggingFaceFW/fineweb"
FILENAME = "data/CC-MAIN-2013-20/000_00000.parquet"
file1=hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset")

from dpk_resize.runtime import Resize
Resize(input_folder= os.path.dirname(file1),
        output_folder= "output",
        resize_max_rows_per_table= 1000).transform()

from dpk_rep_removal.runtime import RepRemoval
RepRemoval(input_folder= "output",
            output_folder= "files-rep_removal",
            rep_removal_contents_column_name='text', 
            rep_removal_num_threads=1,
            ).transform()

@shahrokhDaijavad
Copy link
Member Author

shahrokhDaijavad commented Feb 12, 2025

Thank you, @shivdeep-singh-ibm ! We talked about this exact solution yesterday. Thanks for spelling it out.

@Hajar-Emami This is exactly what we were talking about yesterday!
If you add resize to the list of data-prep-toolkit-transforms that you pip install, then you can use it exactly as Shivdeep has above, to make the file smaller (e.g., 1000 rows) before rep_removal.

@Hajar-Emami
Copy link
Contributor

Hajar-Emami commented Feb 12, 2025

Many Thanks @shivdeep-singh-ibm . Yes, as we discussed with @shahrokhDaijavad and @touma-I, we should include the Resize step before running any of GneissWeb recipe's components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants