Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

ran-iwamoto · 2025-02-13T06:08:47Z

Why are these changes needed?

For using multiple classifiers

Related issue number (if any).

Issue #1034

ran-iwamoto · 2025-02-13T06:10:10Z

@touma-I @shahrokhDaijavad

ran-iwamoto · 2025-02-13T06:18:13Z

Test error above.
But when I ran make test in my laptop, all four test case were succeeded.

shahrokhDaijavad · 2025-02-13T18:56:41Z

Thank you, @ran-iwamoto, for doing this implementation. We will put the HF token issue, which causes the CI/CD test-src to fail, aside for now.

I tried this on my laptop and make test succeeded. However, make run-cli-sample failed with the screenshot of the error attached. It looks like you have not changed the transform_python.py. Can you please look into this? Thanks.

ran-iwamoto · 2025-02-14T01:14:32Z

@shahrokhDaijavad Thank you for telling the issue. I updated Makefile, make run-cli-sample and make run-cli-ray-sample succeeded.

shahrokhDaijavad · 2025-02-14T02:17:40Z

Great! Thanks, @ran-iwamoto. Now, all CI/CD tests are passing too!
I haven't tried, but don't you need to change the same parameters in the two notebooks, too?

ran-iwamoto · 2025-02-14T04:47:31Z

@shahrokhDaijavad I changed two notebooks, thank you for pointing out!

shahrokhDaijavad · 2025-02-14T19:53:20Z

Thank you, @ran-iwamoto. I have tested make run-cli-sample and the notebooks, with the default of one classifier. I still have to do a test by specifying a list of multiple classifiers as parameters, e.g., by creating a notebook that does this. Have you done such a test yourself?

shahrokhDaijavad · 2025-02-15T00:09:14Z

Hi, @ran-iwamoto and @issei-ibm I got instructions today that in all Python code, notebooks and README in this transform, we should use the 5 IBM GneissWeb models that Bhatta has uploaded to HF here: https://huggingface.co/organizations/ibm-granite/activity/all (excluding Bloom) instead of model.bin from facebook/fasttext-language-identification. The README file should mention all of them, but the code and notebooks use only one of them as an example. Will you be able to do this on Monday? Thanks.

ran-iwamoto · 2025-02-17T00:39:22Z

Sure.
Do we need to change also test data?
I don't have good sentence examples for test data, therefore I am using language identification as default model.
And if we extract sentences from somewhere, we may need data clearance.
How do you handle this in other transform?

ran-iwamoto · 2025-02-17T01:12:55Z

I did a test using five classifier in local.py, it worked.

shahrokhDaijavad · 2025-02-17T01:35:45Z

Thanks, @ran-iwamoto. Good question. I think the best example to look at is: https://github.com/Hajar-Emami/data-prep-kit/blob/GneissWeb_notebook/examples/notebooks/GneissWeb/GneissWeb.ipynb. This is the full Gneissweb recipe that Hajar is working on (not complete yet). The idea is to start from a CommonCrawl data set publicly available in HuggingFace and run through multiple transforms used in creating GneissWeb dataset. At the moment, downloading from HF is commented out, and she is using a small subset of a CC data set as input that we have in the repo here: https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/test-data/input/test1.parquet (goes through rep_removal first and then uses quality annotator and other classifiers). I think you can use this dataset without the rep_removal step and just use one classifier like GneissWeb.Quality_annotator as an example (for testing and notebook), while mentioning the other ones in the README with links. Does this make sense?

shahrokhDaijavad · 2025-02-17T01:37:20Z

I did a test using five classifier in local.py, it worked.

Thank you. That's good.

ran-iwamoto · 2025-02-17T03:04:38Z

@shahrokhDaijavad Done. Could you please check it?

shahrokhDaijavad · 2025-02-17T03:19:26Z

@ran-iwamoto Nice job! Thank you very much. I checked all the files that you have changed, and everything looks good! (the Python files, the parquet data file, the README, and the notebooks). I will also test running it tomorrow (Monday my time) and approve it.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad · 2025-02-17T21:41:20Z

@ran-iwamoto I made some changes in the two notebooks:

I changed some comment cells for consistency of parameters with what is being run in the notebook, plus the new way we pip install python and ray transforms (although pip install is not being used in these notebooks)
I changed the Python notebook, so that it uses two models on one command line and tested it successfully. I didn't think I needed to do the same thing for the ray notebook.
The CI/CD test-src failure is again the result of HG authentication, so we ignore it for now.

Please review my changes and comment if what I have done makes sense. Thanks.

ran-iwamoto · 2025-02-18T00:45:15Z

I check the changes, thank you!

shahrokhDaijavad

LGTM.

shahrokhDaijavad · 2025-02-20T15:37:40Z

@ran-iwamoto I just noticed that you haven't added the n_processes_cli_param to the table in the README file (it is in the command line options, but it should be in the table, too). Is the default value equal to 1? Since this table is also shown in the comment cells in the 2 notebooks, the 2 notebooks should also show this. If possible, please set its value to 2 in the python notebook and run it with this value, so we have an example of it. Thanks.

shahrokhDaijavad

Show the gcls_n_processes parameter in README table and in the notebooks.

shahrokhDaijavad · 2025-02-21T01:10:04Z

@ran-iwamoto One other thing about the README file. Please add your name and email as a Contributor to the README file, like this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md
Thanks.

ran-iwamoto · 2025-02-21T01:51:14Z

done

shahrokhDaijavad

Thank you very much, @ran-iwamoto for making all the changes.

issei-ibm and others added 11 commits February 10, 2025 10:58

option to use multithreading.Pool for better throughput

f2ba989

add nlp_parallel.py

d86c51b

multiple models

c13b1f7

multiple models

f7d4244

multiple models

746ea35

multiple classifiers

474207c

multiple classifiers

b933e80

multiple classifiers

7647336

multiple classifiers

109cb2a

multiple classifiers

a0d0d5e

multiple classifiers

cdccdfc

ran-iwamoto mentioned this pull request Feb 13, 2025

[Feature] Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1034

Open

2 tasks

update makefile for using multiple classifiers

50220a2

ran-iwamoto force-pushed the issue1034 branch from 1b2b8af to 50220a2 Compare February 14, 2025 04:41

ran-iwamoto added 2 commits February 14, 2025 13:45

update makefile for using multiple classifiers

d46b112

update makefile for using multiple classifiers

13f5c4f

shahrokhDaijavad closed this Feb 17, 2025

shahrokhDaijavad reopened this Feb 17, 2025

ran-iwamoto added 7 commits February 17, 2025 11:47

change dataset

dc27df6

for using gneissweb med classifier

b523412

for using gneissweb med classifier

be23e2a

for using gneissweb med classifier

1e15ce8

for using gneissweb med classifier

ec14211

for using gneissweb med classifier

b2bb30c

for using gneissweb med classifier

2aac220

shahrokhDaijavad added 2 commits February 17, 2025 12:37

Used multiple transforms in the Python notebook

8d8a0e6

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

changed comment lines of the ray notebook

6cff917

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>

shahrokhDaijavad approved these changes Feb 18, 2025

View reviewed changes

shahrokhDaijavad requested a review from touma-I February 18, 2025 01:03

shahrokhDaijavad requested changes Feb 20, 2025

View reviewed changes

shahrokhDaijavad mentioned this pull request Feb 20, 2025

Improve performance of gneissweb_classification: issue1017 #1029

Closed

Merge branch 'IBM:dev' into issue1034

dcafee8

ran-iwamoto added 4 commits February 21, 2025 10:37

n_processes parameter

d0187a8

n_processes parameter

3d5fbc2

n_processes parameter

8ba3b39

add contributer

14ff60c

shahrokhDaijavad approved these changes Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

ran-iwamoto commented Feb 13, 2025

ran-iwamoto commented Feb 13, 2025

ran-iwamoto commented Feb 13, 2025

shahrokhDaijavad commented Feb 13, 2025

ran-iwamoto commented Feb 14, 2025

shahrokhDaijavad commented Feb 14, 2025

ran-iwamoto commented Feb 14, 2025

shahrokhDaijavad commented Feb 14, 2025

shahrokhDaijavad commented Feb 15, 2025

ran-iwamoto commented Feb 17, 2025

ran-iwamoto commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025 •

edited

Loading

shahrokhDaijavad commented Feb 17, 2025

ran-iwamoto commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

ran-iwamoto commented Feb 18, 2025

shahrokhDaijavad left a comment

shahrokhDaijavad commented Feb 20, 2025

shahrokhDaijavad left a comment

shahrokhDaijavad commented Feb 21, 2025

ran-iwamoto commented Feb 21, 2025

shahrokhDaijavad left a comment

Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

Are you sure you want to change the base?

Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

Conversation

ran-iwamoto commented Feb 13, 2025

Why are these changes needed?

Related issue number (if any).

ran-iwamoto commented Feb 13, 2025

ran-iwamoto commented Feb 13, 2025

shahrokhDaijavad commented Feb 13, 2025

ran-iwamoto commented Feb 14, 2025

shahrokhDaijavad commented Feb 14, 2025

ran-iwamoto commented Feb 14, 2025

shahrokhDaijavad commented Feb 14, 2025

shahrokhDaijavad commented Feb 15, 2025

ran-iwamoto commented Feb 17, 2025

ran-iwamoto commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025 • edited Loading

shahrokhDaijavad commented Feb 17, 2025

ran-iwamoto commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

shahrokhDaijavad commented Feb 17, 2025

ran-iwamoto commented Feb 18, 2025

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Feb 20, 2025

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Feb 21, 2025

ran-iwamoto commented Feb 21, 2025

shahrokhDaijavad left a comment

Choose a reason for hiding this comment

shahrokhDaijavad commented Feb 17, 2025 •

edited

Loading