[Bug] HAP transform crashes if "contents" is empty #1048

burn2l · 2025-02-13T21:22:38Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

If a sample has an empty "contents" field HAP fails with:

Exception executing transform max() arg is an empty sequence

Reproduction script

Append a 51st sample to test-data/input/test1.parquet such as:
{"doc_id": 51, "contents": " "}
and run pytest test/test_hap.py

Anything else

Can be fixed by changing line 72 in transforms/universal/hap/dpk_hap/transform.py to
doc_scores.append(max(temp, default=0.0))

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

shahrokhDaijavad · 2025-02-18T18:52:42Z

@ian-cho Can you please make this fix too and submit a PR? Thanks.

ian-cho · 2025-02-19T00:59:47Z

Can be fixed by changing line 72 in transforms/universal/hap/dpk_hap/transform.py to doc_scores.append(max(temp, default=0.0))

Thanks. How about using -1 instead of 0 to indicate the content is empty? 0 has the specific meaning (NON-HAP)

Hi @burn2l, I did not quite get why you need to append {"doc_id": 51, "contents": " "} in the test file. However, from my understanding, when processing a large number of parquet files, there is a possibility that the contents column may be empty. Is it the same issue you were referring to?

burn2l · 2025-02-19T13:21:29Z

I thought the hap_score range was 0.0 to 1.0 so an empty file with no HAP content should be 0. I don't see much value in using a value outside this range to indicate that the content is empty since that fact can be easily determined by inspecting the contents.
I added the 51st line to the test data as a way to demonstrate the problem of an empty contents, and to test the fix.
Note that #1052 is also a case of empty contents ... i.e. no tokens.

ian-cho · 2025-02-20T04:47:15Z

@burn2l Hi, thanks for raising this issue and the problem definitely need a fix. From my perspective, two points:

If we use 0, users processing a large number of Parquet files may see many instances with a score of 0 and mistakenly assume their files contain no HAP terms. However, this could simply be due to the existence of empty documents. This change would remove the warning functionality for users.
The current transform behaves the same whether using 0 or -1, as we take the maximum value as the final hap score.

I added the 51st line to the test data as a way to demonstrate the problem of an empty contents, and to test the fix.

got it.

burn2l · 2025-02-20T14:21:55Z

Personally I think it's confusing to have to interpret the score in the range 0.0-1.0 its a HAP score but when if -1 means "no-tokens-found". If later transforms try to compute some statistics in a collection of documents then the -1 values will complicate the process.

In practice I believe the model will never give a score of 0 for non-empty text, e.g. the single character 'I' gets 0.007196 and '-' gets 0.014725. Samples with empty contents should probably be removed earlier as they produce no training tokens.

I see that the 'inner' doc_hap transform takes the reverse approach and gives missing text a hap score of 0.99, expecting the sample to be removed later by a threshold filter.

Basically I believe the HAP score should reflect the presence of HAP text ... blank or missing text has no HAP content and therefore should have a score of 0

ian-cho · 2025-02-21T07:47:43Z

@burn2l Indeed, the model rarely assigns a score of 0 to non-empty documents, making it distinguishable from truly empty content. Users are responsible for handling empty documents themselves.

@shahrokhDaijavad (@touma-I ) I agree to use 0 to indicate empty contents and will update the PR accordingly. We consider the issue resolved unless you have any further input.

shahrokhDaijavad · 2025-02-21T15:22:54Z

Thanks, @ian-cho and @burn2l. No further input from me, after your great discussion here. Looking forward to the updated PR.

burn2l added the bug Something isn't working label Feb 13, 2025

shahrokhDaijavad mentioned this issue Feb 14, 2025

[Bug] HAP transform crashes when using a GPU #1047

Open

2 tasks

burn2l mentioned this issue Feb 17, 2025

[Bug] Extreme Tokenize transform fails when the number of documents is not equal to the number of tokens sets #1052

Closed

2 tasks

ian-cho mentioned this issue Feb 19, 2025

Update transform.py #1058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] HAP transform crashes if "contents" is empty #1048

[Bug] HAP transform crashes if "contents" is empty #1048

burn2l commented Feb 13, 2025

shahrokhDaijavad commented Feb 18, 2025

ian-cho commented Feb 19, 2025 •

edited

Loading

burn2l commented Feb 19, 2025 •

edited

Loading

ian-cho commented Feb 20, 2025

burn2l commented Feb 20, 2025

ian-cho commented Feb 21, 2025

shahrokhDaijavad commented Feb 21, 2025

[Bug] HAP transform crashes if "contents" is empty #1048

[Bug] HAP transform crashes if "contents" is empty #1048

Comments

burn2l commented Feb 13, 2025

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

shahrokhDaijavad commented Feb 18, 2025

ian-cho commented Feb 19, 2025 • edited Loading

burn2l commented Feb 19, 2025 • edited Loading

ian-cho commented Feb 20, 2025

burn2l commented Feb 20, 2025

ian-cho commented Feb 21, 2025

shahrokhDaijavad commented Feb 21, 2025

ian-cho commented Feb 19, 2025 •

edited

Loading

burn2l commented Feb 19, 2025 •

edited

Loading