Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit Encoding Handling for PDF Parsing #8905

Open
JasperLS opened this issue Feb 21, 2025 · 1 comment
Open

Explicit Encoding Handling for PDF Parsing #8905

JasperLS opened this issue Feb 21, 2025 · 1 comment
Labels
P1 High priority, add to the next sprint type:feature New feature or request

Comments

@JasperLS
Copy link

JasperLS commented Feb 21, 2025

Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.

Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.

@julian-risch julian-risch added P1 High priority, add to the next sprint type:feature New feature or request labels Feb 21, 2025
@lbux
Copy link
Contributor

lbux commented Feb 22, 2025

From my research into how both PyPDF and PDFMiner handle text extraction for #8491, I’ve found that the presence of cid:x values often signals that the PDF itself is missing necessary character-to-Unicode mappings. This tends to happen when the fonts or character encodings in the PDF are incomplete or poorly defined. In this case the issue might not be with the extraction itself but with the underlying PDF structure.

For PDFMiner, the cid:x error usually happens because it defaults to showing raw character IDs when it cannot map characters to Unicode. This happens when the PDF uses fonts with no corresponding Unicode mapping (a common occurrence with custom or embedded fonts). If you open the PDF in a viewer, try copying the text and pasting it into a text editor. If it results in gibberish, that usually confirms that the issue lies within the PDF's encoding itself.

Similarly, with PyPDF, when encountering issues with certain fonts not being extracted properly, it could be due to the absence of a translation table (such as the /ToUnicode field for embedded fonts), which makes it difficult to decode the characters properly.

So, while we may be unable to correct the poorly extracted text, we can do a post-conversion cleanup to ensure the fallback behavior of the converters producing the raw character IDs is removed, preventing these artifacts from affecting downstream tasks (although this could result in some data loss depending on how much it fallbacks). One approach could be to use the DocumentCleaner component's remove_substrings or remove_regex to clean the unwanted patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority, add to the next sprint type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants