Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] HAP transform crashes if "contents" is empty #1048

Open
2 tasks done
burn2l opened this issue Feb 13, 2025 · 7 comments
Open
2 tasks done

[Bug] HAP transform crashes if "contents" is empty #1048

burn2l opened this issue Feb 13, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@burn2l
Copy link

burn2l commented Feb 13, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

If a sample has an empty "contents" field HAP fails with:

Exception executing transform max() arg is an empty sequence

Reproduction script

Append a 51st sample to test-data/input/test1.parquet such as:
{"doc_id": 51, "contents": " "}
and run pytest test/test_hap.py

Anything else

Can be fixed by changing line 72 in transforms/universal/hap/dpk_hap/transform.py to
doc_scores.append(max(temp, default=0.0))

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad
Copy link
Member

@ian-cho Can you please make this fix too and submit a PR? Thanks.

@ian-cho
Copy link
Collaborator

ian-cho commented Feb 19, 2025

Can be fixed by changing line 72 in transforms/universal/hap/dpk_hap/transform.py to doc_scores.append(max(temp, default=0.0))

Thanks. How about using -1 instead of 0 to indicate the content is empty? 0 has the specific meaning (NON-HAP)

Hi @burn2l, I did not quite get why you need to append {"doc_id": 51, "contents": " "} in the test file. However, from my understanding, when processing a large number of parquet files, there is a possibility that the contents column may be empty. Is it the same issue you were referring to?

@burn2l
Copy link
Author

burn2l commented Feb 19, 2025

I thought the hap_score range was 0.0 to 1.0 so an empty file with no HAP content should be 0. I don't see much value in using a value outside this range to indicate that the content is empty since that fact can be easily determined by inspecting the contents.
I added the 51st line to the test data as a way to demonstrate the problem of an empty contents, and to test the fix.
Note that #1052 is also a case of empty contents ... i.e. no tokens.

@ian-cho
Copy link
Collaborator

ian-cho commented Feb 20, 2025

@burn2l Hi, thanks for raising this issue and the problem definitely need a fix. From my perspective, two points:

  • If we use 0, users processing a large number of Parquet files may see many instances with a score of 0 and mistakenly assume their files contain no HAP terms. However, this could simply be due to the existence of empty documents. This change would remove the warning functionality for users.
  • The current transform behaves the same whether using 0 or -1, as we take the maximum value as the final hap score.

I added the 51st line to the test data as a way to demonstrate the problem of an empty contents, and to test the fix.

got it.

@burn2l
Copy link
Author

burn2l commented Feb 20, 2025

Personally I think it's confusing to have to interpret the score in the range 0.0-1.0 its a HAP score but when if -1 means "no-tokens-found". If later transforms try to compute some statistics in a collection of documents then the -1 values will complicate the process.

In practice I believe the model will never give a score of 0 for non-empty text, e.g. the single character 'I' gets 0.007196 and '-' gets 0.014725. Samples with empty contents should probably be removed earlier as they produce no training tokens.

I see that the 'inner' doc_hap transform takes the reverse approach and gives missing text a hap score of 0.99, expecting the sample to be removed later by a threshold filter.

Basically I believe the HAP score should reflect the presence of HAP text ... blank or missing text has no HAP content and therefore should have a score of 0

@ian-cho
Copy link
Collaborator

ian-cho commented Feb 21, 2025

@burn2l Indeed, the model rarely assigns a score of 0 to non-empty documents, making it distinguishable from truly empty content. Users are responsible for handling empty documents themselves.

@shahrokhDaijavad (@touma-I ) I agree to use 0 to indicate empty contents and will update the PR accordingly. We consider the issue resolved unless you have any further input.

@shahrokhDaijavad
Copy link
Member

Thanks, @ian-cho and @burn2l. No further input from me, after your great discussion here. Looking forward to the updated PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants