-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing Bloom Annotator #978
base: dev
Are you sure you want to change the base?
Conversation
@shahrokhDaijavad @touma-I added notebook for python runtime and also put a link to it from README file https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/universal/bloom/bloom_python.ipynb |
@ian-cho I haven't had time to test the notebook, but I just realized that you are using bf.bloom model. Is that open source and available to download, so we don't need to include it in the repo? |
@shahrokhDaijavad thanks for pointing this out. Currently, I am not aware of any small, downloadable models okay for this demonstration. Thus, I trained a small model locally. In the future, we can replace this local model path with a Hugging Face model when it is ready later I guess. |
Thank you, @ian-cho. The size of the model 240KB is quite small, but for us to include it statically in an IBM-owned open repo, we need permission from IBM. I will bring this up with Bhatta, who may be able to get permission for us. |
@BishwaBhatta As you may know, @ian-cho is using a small model that he has trained himself for this transform. This is fine for testing, but we cannot merge the code with that model included in the outer repo. So, if this small model will not be used at the end and an HF model will replace it, after successful testing, should we take it out before merging the code? |
@shahrokhDaijavad Hi, thanks for asking.
|
Thank you, @ian-cho, for testing with the large model. For consistency with everything else in DPK, I think we should stick with the contents instead of text for the document column. |
@shahrokhDaijavad ok, then can I comment out this line ? The transform does not need text or contents. By doing this, we can stick to |
@touma-I Do you see any problem with what @ian-cho is asking? |
@ian-cho Sorry that I did not test the notebook until today. I did a |
Hi, @shahrokhDaijavad, I have no ideas why it failed. Your and @touma-I 's help would be appreciated! |
Signed-off-by: Maroun Touma <[email protected]>
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@ian-cho After @touma-I fixed the workflow test failure, I saw that test-src was failing and I pushed the same fix that we did for using the new API and now everything is passing in CI/CD. |
@touma-I @shahrokhDaijavad Thanks for your fix! Retaining bf.bloom would make the repo self-contained if it is fine to keep it. Also, I added two comments here and here for the user's convenience for filtering out FineWeb into GneissWeb if it is okay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@ian-cho As I was going over the README file with @touma-I today, we felt that it was not clear enough for our audience in the open-source community why we are doing this. So, I made some changes in the README file. Can you please review my changes and see if you want to modify/add anything? Thanks. |
Hi, @shahrokhDaijavad @touma-I , thanks. The added summary looks good to me. |
@touma-I @shahrokhDaijavad @BishwaBhatta (@issei-ibm) I updated the model path in the scripts to ibm-granite/GneissWeb.bloom. However, during on-the-fly testing, I run into the OOM error due to the model's 28GB size. any ideas? Thanks a lot! |
Thank you for trying this out, @ian-cho. I am not surprised, because I would definitely get an OOM on my laptop that has 16GB memory. How much memory do you have on your laptop? When you said a couple of weeks ago (see above) that you tried it with the 28GB model that Yohei trained, how did you do it? |
@shahrokhDaijavad No, on my laptop, it was successful to load 28GB model with the on-the-fly test. When I push it to DPK, one test failed due to OOM error as follow: |
Great. Thank you @ian-cho. So, the OOM happens in the Ubuntu server that runs CI/CD tests. Please keep the code as is with the real model. If your laptop has more than 16GB of memory, please let me know, and I will add a sentence to the README file, stating that "using the large model file will require a machine with more than 16GB of memory". |
@shahrokhDaijavad Hi, Thanks. My laptop has more than 16GB memory. |
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@ian-cho Can you please review the changes I made to the README and to the notebook? Thanks. |
Hi, @shahrokhDaijavad, thank you for making these changes. It looks good to me. Just two quick questions:
|
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@ian-cho I made a small change in the comment cell of the notebook that mentions pip install. With the new APIs, we don't need to pip install data-prep-toolkit anymore. |
@shahrokhDaijavad Thank you for the modification. In HuggingFace’s release of the GinessWeb Bloom filter, the link to DPK now points to the dev branch of my personal Git repository. It should be pointing to the DPK outer branch after the merge. Understood, @ian-cho We will change the link after the merge. |
Why are these changes needed?
Add an additional column to indicate whether the document is present in the pre-existing Bloom filter model. This is specifically intended for the GneissWeb release.