Implementing Bloom Annotator #978

ian-cho · 2025-01-27T11:55:07Z

Why are these changes needed?

Add an additional column to indicate whether the document is present in the pre-existing Bloom filter model. This is specifically intended for the GneissWeb release.

shahrokhDaijavad · 2025-01-27T18:12:39Z

@ian-cho I reviewed the README file and ran python local_python.py successfully on my laptop. Please add details to issue #981.
Also, as done by all transforms, please add a simple notebook for the Python run-time (and a link to it from the README file).

ian-cho · 2025-01-28T13:00:28Z

@shahrokhDaijavad @touma-I added notebook for python runtime and also put a link to it from README file https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/universal/bloom/bloom_python.ipynb

shahrokhDaijavad · 2025-01-28T22:42:21Z

@ian-cho I haven't had time to test the notebook, but I just realized that you are using bf.bloom model. Is that open source and available to download, so we don't need to include it in the repo?

ian-cho · 2025-01-29T01:37:34Z

@shahrokhDaijavad thanks for pointing this out. Currently, I am not aware of any small, downloadable models okay for this demonstration. Thus, I trained a small model locally. In the future, we can replace this local model path with a Hugging Face model when it is ready later I guess.

shahrokhDaijavad · 2025-01-29T02:03:39Z

@shahrokhDaijavad thanks for pointing this out. Currently, I am not aware of any small, downloadable models okay for this demonstration. Thus, I trained a small model locally. In the future, we can replace this local model path with a Hugging Face model when it is ready later I guess.

Thank you, @ian-cho. The size of the model 240KB is quite small, but for us to include it statically in an IBM-owned open repo, we need permission from IBM. I will bring this up with Bhatta, who may be able to get permission for us.

shahrokhDaijavad · 2025-01-29T21:57:20Z

@BishwaBhatta As you may know, @ian-cho is using a small model that he has trained himself for this transform. This is fine for testing, but we cannot merge the code with that model included in the outer repo. So, if this small model will not be used at the end and an HF model will replace it, after successful testing, should we take it out before merging the code?

ian-cho · 2025-01-31T02:11:40Z

@shahrokhDaijavad Hi, thanks for asking.

I tested the gneissweb.bloom (28GB) that Yohei trained, the current bloom transform code work with it. So, from my side, it is okay to remove that small model and move forward. But I am open to @BishwaBhatta's opinion.
@BishwaBhatta asked if the code can process fineweb parquet which has more than 30 columns. Among others, the document column is called text not contents, I think we have been through this naming issue for HAP, so shall we stick to the name contents instead of text as in test1.parquet? Otherwise, I would like to upload new test1.parquet.
Thank you!

shahrokhDaijavad · 2025-01-31T02:37:30Z

@shahrokhDaijavad Hi, thanks for asking.

* I tested the _gneissweb.bloom_ (28GB) that Yohei trained, the current bloom transform code work with it. So, from my side, it is okay to remove that small model and move forward. But I am open to @BishwaBhatta's opinion.

* @BishwaBhatta asked if the code can process fineweb parquet which has more than 30 columns. Among others, the document column is called _text_ not _contents_, I think we have been through this naming issue for HAP, so shall we stick to the name _contents_ instead of _text_ as in [test1.parquet](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/universal/bloom/test-data/input/test1.parquet)? Otherwise, I would like to upload new test1.parquet.
  Thank you!

Thank you, @ian-cho, for testing with the large model. For consistency with everything else in DPK, I think we should stick with the contents instead of text for the document column.

ian-cho · 2025-01-31T03:11:11Z

@shahrokhDaijavad ok, then can I comment out this line ? The transform does not need text or contents. By doing this, we can stick to contents naming and also users do not have to change the name of fineweb parquet.

shahrokhDaijavad · 2025-01-31T04:13:09Z

@shahrokhDaijavad ok, then can I comment out this line ? The transform does not need text or contents. By doing this, we can stick to contents naming and also users do not have to change the name of fineweb parquet.

@touma-I Do you see any problem with what @ian-cho is asking?

shahrokhDaijavad · 2025-01-31T18:44:12Z

@ian-cho Sorry that I did not test the notebook until today. I did a make venv and in my venv, tried python local_python and it ran successfully. I had deleted the local output folder first (it should not have been put in the repo, as it gets created after the run, so please delete it). After the run, the output folder was created and had the 2 metadata.json and test1.parquet files were there. However, if I run the notebook, I get the error below:

What am I missing? Can you please look into this? Thanks.

ian-cho · 2025-02-03T06:12:25Z

Hi, @shahrokhDaijavad, I have no ideas why it failed. Your and @touma-I 's help would be appreciated!

Signed-off-by: Maroun Touma <[email protected]>

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-02-04T20:30:38Z

@ian-cho After @touma-I fixed the workflow test failure, I saw that test-src was failing and I pushed the same fix that we did for using the new API and now everything is passing in CI/CD.
I think the only remaining issue is that we still have the small bf.bloom file included. If this is a meaningless model that is just for testing, it may be ok keeping it (like small meaningless parquet files we use for testing).

ian-cho · 2025-02-05T04:56:44Z

@touma-I @shahrokhDaijavad Thanks for your fix! Retaining bf.bloom would make the repo self-contained if it is fine to keep it. Also, I added two comments here and here for the user's convenience for filtering out FineWeb into GneissWeb if it is okay.

shahrokhDaijavad · 2025-02-05T23:42:24Z

Thank you, @ian-cho. Some gneissweb fasttext models were added to HuggngFace yesterday. Who is adding the gneissweb.bloom (28GB) that Yohei trained to HF?
I am approving this PR and pass it to @touma-I, before it is merged.

shahrokhDaijavad

LGTM.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-02-06T23:59:23Z

@ian-cho As I was going over the README file with @touma-I today, we felt that it was not clear enough for our audience in the open-source community why we are doing this. So, I made some changes in the README file. Can you please review my changes and see if you want to modify/add anything? Thanks.

ian-cho · 2025-02-07T02:50:31Z

Hi, @shahrokhDaijavad @touma-I , thanks. The added summary looks good to me.

ian-cho · 2025-02-14T11:52:00Z

@touma-I @shahrokhDaijavad @BishwaBhatta (@issei-ibm) I updated the model path in the scripts to ibm-granite/GneissWeb.bloom. However, during on-the-fly testing, I run into the OOM error due to the model's 28GB size. any ideas? Thanks a lot!

shahrokhDaijavad · 2025-02-14T21:49:05Z

Thank you for trying this out, @ian-cho. I am not surprised, because I would definitely get an OOM on my laptop that has 16GB memory. How much memory do you have on your laptop? When you said a couple of weeks ago (see above) that you tried it with the 28GB model that Yohei trained, how did you do it?
In any case, if we have some idea of the minimum memory needed to try this successfully when the 28GB file is downloaded on the fly, we should put that in the README file. Most community users though won't have access to machines with more than 16GB memory. Maybe you should put back the small bf.bloom so that local testing succeeds, but in the README, we point and link to the real model file on HF/IBM-granite.

ian-cho · 2025-02-15T01:06:21Z

@shahrokhDaijavad No, on my laptop, it was successful to load 28GB model with the on-the-fly test. When I push it to DPK, one test failed due to OOM error as follow:

See from 497th line.

shahrokhDaijavad · 2025-02-15T02:07:10Z

Great. Thank you @ian-cho. So, the OOM happens in the Ubuntu server that runs CI/CD tests. Please keep the code as is with the real model. If your laptop has more than 16GB of memory, please let me know, and I will add a sentence to the README file, stating that "using the large model file will require a machine with more than 16GB of memory".

ian-cho · 2025-02-15T02:25:20Z

@shahrokhDaijavad Hi, Thanks. My laptop has more than 16GB memory.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-02-15T19:10:02Z

@ian-cho Can you please review the changes I made to the README and to the notebook? Thanks.

ian-cho · 2025-02-17T04:34:52Z

Hi, @shahrokhDaijavad, thank you for making these changes. It looks good to me. Just two quick questions:

Is laptop with 16GB sufficient to load 28GB bloom filter model? (Mine actually has 64GB memory)
As for bloom_python.ipynb, only the OOM error remains, while everything else seems fine after the modification? Thanks a lot.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-02-17T23:28:43Z

@ian-cho I made a small change in the comment cell of the notebook that mentions pip install. With the new APIs, we don't need to pip install data-prep-toolkit anymore.
As for your questions, in my laptop that has only 16GB of memory, when I tried today, I got a Jupyter kernel crash when I ran the notebook, so 16GB is not enough.
For your second question, yes, I am fine with approving this PR (even with the CI/CD error), but I am not able to merge and Maroun has asked me not to ask other admins who can merge until he comes back next week. Is there an urgent need to merge this before next week?
I am able to create a pip install release on PyPi that includes the bloom transform, even if it is still in PR and not merged. Is there an urgent need for that?

ian-cho · 2025-02-18T00:34:53Z

@shahrokhDaijavad Thank you for the modification. In HuggingFace’s release of the GinessWeb Bloom filter, the link to DPK now points to the dev branch of my personal Git repository. It should be pointing to the DPK outer branch after the merge.

Understood, @ian-cho We will change the link after the merge.

ian-cho and others added 12 commits January 24, 2025 22:03

push code only

4181ee7

Add files via upload

f55db6a

Add files via upload

498f467

Add files via upload

a431d3c

Add files via upload

c2b2e0a

Merge branch 'IBM:dev' into dev

ccb2904

Update README.md

b0131ea

Add files via upload

b6fb4f4

Update local.py

a81f9c4

Update local_python.py

3782a16

Update test_bloom.py

c21f1ee

Update README.md

be5f2bb

ian-cho requested review from shahrokhDaijavad and touma-I January 27, 2025 11:55

Update README.md

f98939c

shahrokhDaijavad mentioned this pull request Jan 27, 2025

Bloom annotator implementation for GneissWeb data #981

Open

2 tasks

ian-cho added 3 commits January 28, 2025 21:52

Add files via upload

08dbbe9

Merge branch 'IBM:dev' into dev

15702ac

Update README.md

fe98d40

touma-I and others added 2 commits February 4, 2025 13:42

added CI/CD workflow

91ad6cb

Signed-off-by: Maroun Touma <[email protected]>

Fix for the test-src failing in CI/CD

6579ed2

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

Update transform.py

7e84256

shahrokhDaijavad approved these changes Feb 5, 2025

View reviewed changes

Improved README file

e8e7e8e

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

ian-cho added 8 commits February 14, 2025 13:49

Merge branch 'IBM:dev' into dev

b75bb69

Update transform.py

5533206

Update local_python.py

6751baa

Update test_bloom.py

1624b24

Delete transforms/universal/bloom/bf.bloom

844d848

Update transform.py

b72ad77

Update transform.py

f9c634b

Update requirements.txt

b08b5a3

README and notebook changes

20b3484

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

comment cell for notebook shows the new way for pip install

9d79a24

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Bloom Annotator #978

Implementing Bloom Annotator #978

ian-cho commented Jan 27, 2025

shahrokhDaijavad commented Jan 27, 2025

ian-cho commented Jan 28, 2025

shahrokhDaijavad commented Jan 28, 2025

ian-cho commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

ian-cho commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

ian-cho commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

ian-cho commented Feb 3, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 4, 2025

ian-cho commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

shahrokhDaijavad left a comment

shahrokhDaijavad commented Feb 6, 2025

ian-cho commented Feb 7, 2025

ian-cho commented Feb 14, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 14, 2025

ian-cho commented Feb 15, 2025

shahrokhDaijavad commented Feb 15, 2025

ian-cho commented Feb 15, 2025

shahrokhDaijavad commented Feb 15, 2025

ian-cho commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

ian-cho commented Feb 18, 2025 •

edited by shahrokhDaijavad

Loading

Implementing Bloom Annotator #978

Are you sure you want to change the base?

Implementing Bloom Annotator #978

Conversation

ian-cho commented Jan 27, 2025

Why are these changes needed?

shahrokhDaijavad commented Jan 27, 2025

ian-cho commented Jan 28, 2025

shahrokhDaijavad commented Jan 28, 2025

ian-cho commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

shahrokhDaijavad commented Jan 29, 2025

ian-cho commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

ian-cho commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

shahrokhDaijavad commented Jan 31, 2025

ian-cho commented Feb 3, 2025 • edited Loading

shahrokhDaijavad commented Feb 4, 2025

ian-cho commented Feb 5, 2025

shahrokhDaijavad commented Feb 5, 2025

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Feb 6, 2025

ian-cho commented Feb 7, 2025

ian-cho commented Feb 14, 2025 • edited Loading

shahrokhDaijavad commented Feb 14, 2025

ian-cho commented Feb 15, 2025

shahrokhDaijavad commented Feb 15, 2025

ian-cho commented Feb 15, 2025

shahrokhDaijavad commented Feb 15, 2025

ian-cho commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

ian-cho commented Feb 18, 2025 • edited by shahrokhDaijavad Loading

ian-cho commented Feb 3, 2025 •

edited

Loading

ian-cho commented Feb 14, 2025 •

edited

Loading

ian-cho commented Feb 18, 2025 •

edited by shahrokhDaijavad

Loading