-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously #1046
base: dev
Are you sure you want to change the base?
Conversation
Test error above. |
Thank you, @ran-iwamoto, for doing this implementation. We will put the HF token issue, which causes the CI/CD test-src to fail, aside for now. I tried this on my laptop and |
@shahrokhDaijavad Thank you for telling the issue. I updated Makefile, |
Great! Thanks, @ran-iwamoto. Now, all CI/CD tests are passing too! |
1b2b8af
to
50220a2
Compare
@shahrokhDaijavad I changed two notebooks, thank you for pointing out! |
Thank you, @ran-iwamoto. I have tested |
Hi, @ran-iwamoto and @issei-ibm I got instructions today that in all Python code, notebooks and README in this transform, we should use the 5 IBM GneissWeb models that Bhatta has uploaded to HF here: https://huggingface.co/organizations/ibm-granite/activity/all (excluding Bloom) instead of |
Sure. |
I did a test using five classifier in local.py, it worked. |
Thanks, @ran-iwamoto. Good question. I think the best example to look at is: https://github.com/Hajar-Emami/data-prep-kit/blob/GneissWeb_notebook/examples/notebooks/GneissWeb/GneissWeb.ipynb. This is the full Gneissweb recipe that Hajar is working on (not complete yet). The idea is to start from a CommonCrawl data set publicly available in HuggingFace and run through multiple transforms used in creating GneissWeb dataset. At the moment, downloading from HF is commented out, and she is using a small subset of a CC data set as input that we have in the repo here: https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/rep_removal/test-data/input/test1.parquet (goes through rep_removal first and then uses quality annotator and other classifiers). I think you can use this dataset without the rep_removal step and just use one classifier like GneissWeb.Quality_annotator as an example (for testing and notebook), while mentioning the other ones in the README with links. Does this make sense? |
Thank you. That's good. |
@shahrokhDaijavad Done. Could you please check it? |
@ran-iwamoto Nice job! Thank you very much. I checked all the files that you have changed, and everything looks good! (the Python files, the parquet data file, the README, and the notebooks). I will also test running it tomorrow (Monday my time) and approve it. |
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@ran-iwamoto I made some changes in the two notebooks:
Please review my changes and comment if what I have done makes sense. Thanks. |
I check the changes, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@ran-iwamoto I just noticed that you haven't added the n_processes_cli_param to the table in the README file (it is in the command line options, but it should be in the table, too). Is the default value equal to 1? Since this table is also shown in the comment cells in the 2 notebooks, the 2 notebooks should also show this. If possible, please set its value to 2 in the python notebook and run it with this value, so we have an example of it. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show the gcls_n_processes parameter in README table and in the notebooks.
@ran-iwamoto One other thing about the README file. Please add your name and email as a Contributor to the README file, like this one: https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md |
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, @ran-iwamoto for making all the changes.
Why are these changes needed?
For using multiple classifiers
Related issue number (if any).
Issue #1034