-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] HAP transform crashes if "contents" is empty #1048
Comments
@ian-cho Can you please make this fix too and submit a PR? Thanks. |
Thanks. How about using -1 instead of 0 to indicate the content is empty? 0 has the specific meaning (NON-HAP) Hi @burn2l, I did not quite get why you need to append {"doc_id": 51, "contents": " "} in the test file. However, from my understanding, when processing a large number of parquet files, there is a possibility that the |
I thought the hap_score range was 0.0 to 1.0 so an empty file with no HAP content should be 0. I don't see much value in using a value outside this range to indicate that the content is empty since that fact can be easily determined by inspecting the contents. |
@burn2l Hi, thanks for raising this issue and the problem definitely need a fix. From my perspective, two points:
got it. |
Personally I think it's confusing to have to interpret the score in the range 0.0-1.0 its a HAP score but when if -1 means "no-tokens-found". If later transforms try to compute some statistics in a collection of documents then the -1 values will complicate the process. In practice I believe the model will never give a score of 0 for non-empty text, e.g. the single character 'I' gets 0.007196 and '-' gets 0.014725. Samples with empty contents should probably be removed earlier as they produce no training tokens. I see that the 'inner' doc_hap transform takes the reverse approach and gives missing text a hap score of 0.99, expecting the sample to be removed later by a threshold filter. Basically I believe the HAP score should reflect the presence of HAP text ... blank or missing text has no HAP content and therefore should have a score of 0 |
@burn2l Indeed, the model rarely assigns a score of 0 to non-empty documents, making it distinguishable from truly empty content. Users are responsible for handling empty documents themselves. @shahrokhDaijavad (@touma-I ) I agree to use 0 to indicate empty contents and will update the PR accordingly. We consider the issue resolved unless you have any further input. |
Search before asking
Component
Transforms/Other
What happened + What you expected to happen
If a sample has an empty "contents" field HAP fails with:
Exception executing transform max() arg is an empty sequence
Reproduction script
Append a 51st sample to test-data/input/test1.parquet such as:
{"doc_id": 51, "contents": " "}
and run pytest test/test_hap.py
Anything else
Can be fixed by changing line 72 in transforms/universal/hap/dpk_hap/transform.py to
doc_scores.append(max(temp, default=0.0))
OS
Red Hat Enterprise Linux (RHEL)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: