Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046

Open
wants to merge 28 commits into
base: dev
Choose a base branch
from

Conversation

ran-iwamoto
Copy link
Contributor

Why are these changes needed?

For using multiple classifiers

Related issue number (if any).

Issue #1034

@ran-iwamoto
Copy link
Contributor Author

@touma-I @shahrokhDaijavad

@ran-iwamoto
Copy link
Contributor Author

Test error above.
But when I ran make test in my laptop, all four test case were succeeded.

@shahrokhDaijavad
Copy link
Member

Thank you, @ran-iwamoto, for doing this implementation. We will put the HF token issue, which causes the CI/CD test-src to fail, aside for now.

I tried this on my laptop and make test succeeded. However, make run-cli-sample failed with the screenshot of the error attached. It looks like you have not changed the transform_python.py. Can you please look into this? Thanks.
Screenshot 2025-02-13 at 10 32 54

@ran-iwamoto
Copy link
Contributor Author

@shahrokhDaijavad Thank you for telling the issue. I updated Makefile, make run-cli-sample and make run-cli-ray-sample succeeded.

@shahrokhDaijavad
Copy link
Member

Great! Thanks, @ran-iwamoto. Now, all CI/CD tests are passing too!
I haven't tried, but don't you need to change the same parameters in the two notebooks, too?

@ran-iwamoto
Copy link
Contributor Author

@shahrokhDaijavad I changed two notebooks, thank you for pointing out!

@shahrokhDaijavad
Copy link
Member

Thank you, @ran-iwamoto. I have tested make run-cli-sample and the notebooks, with the default of one classifier. I still have to do a test by specifying a list of multiple classifiers as parameters, e.g., by creating a notebook that does this. Have you done such a test yourself?

@shahrokhDaijavad
Copy link
Member

Hi, @ran-iwamoto and @issei-ibm I got instructions today that in all Python code, notebooks and README in this transform, we should use the 5 IBM GneissWeb models that Bhatta has uploaded to HF here: https://huggingface.co/organizations/ibm-granite/activity/all (excluding Bloom) instead of model.bin from facebook/fasttext-language-identification. The README file should mention all of them, but the code and notebooks use only one of them as an example. Will you be able to do this on Monday? Thanks.

@ran-iwamoto
Copy link
Contributor Author

Sure.
Do we need to change also test data?
I don't have good sentence examples for test data, therefore I am using language identification as default model.
And if we extract sentences from somewhere, we may need data clearance.
How do you handle this in other transform?

@ran-iwamoto
Copy link
Contributor Author

I did a test using five classifier in local.py, it worked.

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Feb 17, 2025

Thanks, @ran-iwamoto. Good question. I think the best example to look at is: https://github.com/Hajar-Emami/data-prep-kit/blob/GneissWeb_notebook/examples/notebooks/GneissWeb/GneissWeb.ipynb. This is the full Gneissweb recipe that Hajar is working on (not complete yet). The idea is to start from a CommonCrawl data set publicly available in HuggingFace and run through multiple transforms used in creating GneissWeb dataset. At the moment, downloading from HF is commented out, and she is using a small subset of a CC data set as input that we have in the repo here: https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/test-data/input/test1.parquet (goes through rep_removal first and then uses quality annotator and other classifiers). I think you can use this dataset without the rep_removal step and just use one classifier like GneissWeb.Quality_annotator as an example (for testing and notebook), while mentioning the other ones in the README with links. Does this make sense?

@shahrokhDaijavad
Copy link
Member

I did a test using five classifier in local.py, it worked.

Thank you. That's good.

@ran-iwamoto
Copy link
Contributor Author

@shahrokhDaijavad Done. Could you please check it?

@shahrokhDaijavad
Copy link
Member

@ran-iwamoto Nice job! Thank you very much. I checked all the files that you have changed, and everything looks good! (the Python files, the parquet data file, the README, and the notebooks). I will also test running it tomorrow (Monday my time) and approve it.

@shahrokhDaijavad
Copy link
Member

@ran-iwamoto I made some changes in the two notebooks:

  1. I changed some comment cells for consistency of parameters with what is being run in the notebook, plus the new way we pip install python and ray transforms (although pip install is not being used in these notebooks)
  2. I changed the Python notebook, so that it uses two models on one command line and tested it successfully. I didn't think I needed to do the same thing for the ray notebook.
  3. The CI/CD test-src failure is again the result of HG authentication, so we ignore it for now.

Please review my changes and comment if what I have done makes sense. Thanks.

@ran-iwamoto
Copy link
Contributor Author

I check the changes, thank you!

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@shahrokhDaijavad
Copy link
Member

@ran-iwamoto I just noticed that you haven't added the n_processes_cli_param to the table in the README file (it is in the command line options, but it should be in the table, too). Is the default value equal to 1? Since this table is also shown in the comment cells in the 2 notebooks, the 2 notebooks should also show this. If possible, please set its value to 2 in the python notebook and run it with this value, so we have an example of it. Thanks.

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show the gcls_n_processes parameter in README table and in the notebooks.

@shahrokhDaijavad
Copy link
Member

@ran-iwamoto One other thing about the README file. Please add your name and email as a Contributor to the README file, like this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md
Thanks.

@ran-iwamoto
Copy link
Contributor Author

done

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, @ran-iwamoto for making all the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants