Explicit Encoding Handling for PDF Parsing #8905

JasperLS · 2025-02-21T16:25:00Z

Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.

Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.

lbux · 2025-02-22T06:24:00Z

From my research into how both PyPDF and PDFMiner handle text extraction for #8491, I’ve found that the presence of cid:x values often signals that the PDF itself is missing necessary character-to-Unicode mappings. This tends to happen when the fonts or character encodings in the PDF are incomplete or poorly defined. In this case the issue might not be with the extraction itself but with the underlying PDF structure.

For PDFMiner, the cid:x error usually happens because it defaults to showing raw character IDs when it cannot map characters to Unicode. This happens when the PDF uses fonts with no corresponding Unicode mapping (a common occurrence with custom or embedded fonts). If you open the PDF in a viewer, try copying the text and pasting it into a text editor. If it results in gibberish, that usually confirms that the issue lies within the PDF's encoding itself.

Similarly, with PyPDF, when encountering issues with certain fonts not being extracted properly, it could be due to the absence of a translation table (such as the /ToUnicode field for embedded fonts), which makes it difficult to decode the characters properly.

So, while we may be unable to correct the poorly extracted text, we can do a post-conversion cleanup to ensure the fallback behavior of the converters producing the raw character IDs is removed, preventing these artifacts from affecting downstream tasks (although this could result in some data loss depending on how much it fallbacks). One approach could be to use the DocumentCleaner component's remove_substrings or remove_regex to clean the unwanted patterns.

julian-risch added P1 High priority, add to the next sprint type:feature New feature or request labels Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit Encoding Handling for PDF Parsing #8905

Explicit Encoding Handling for PDF Parsing #8905

JasperLS commented Feb 21, 2025 •

edited

Loading

lbux commented Feb 22, 2025

Explicit Encoding Handling for PDF Parsing #8905

Explicit Encoding Handling for PDF Parsing #8905

Comments

JasperLS commented Feb 21, 2025 • edited Loading

lbux commented Feb 22, 2025

JasperLS commented Feb 21, 2025 •

edited

Loading