You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.
Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.
The text was updated successfully, but these errors were encountered:
From my research into how both PyPDF and PDFMiner handle text extraction for #8491, I’ve found that the presence of cid:x values often signals that the PDF itself is missing necessary character-to-Unicode mappings. This tends to happen when the fonts or character encodings in the PDF are incomplete or poorly defined. In this case the issue might not be with the extraction itself but with the underlying PDF structure.
For PDFMiner, the cid:x error usually happens because it defaults to showing raw character IDs when it cannot map characters to Unicode. This happens when the PDF uses fonts with no corresponding Unicode mapping (a common occurrence with custom or embedded fonts). If you open the PDF in a viewer, try copying the text and pasting it into a text editor. If it results in gibberish, that usually confirms that the issue lies within the PDF's encoding itself.
Similarly, with PyPDF, when encountering issues with certain fonts not being extracted properly, it could be due to the absence of a translation table (such as the /ToUnicode field for embedded fonts), which makes it difficult to decode the characters properly.
So, while we may be unable to correct the poorly extracted text, we can do a post-conversion cleanup to ensure the fallback behavior of the converters producing the raw character IDs is removed, preventing these artifacts from affecting downstream tasks (although this could result in some data loss depending on how much it fallbacks). One approach could be to use the DocumentCleaner component's remove_substrings or remove_regex to clean the unwanted patterns.
Is your feature request related to a problem? Please describe.
PDFs with non-UTF-8 encoding (e.g., ANSI, cp1252) are not indexed correctly in Haystack’s document pipeline. This results in missing text, corrupted characters (e.g., (cid:xx) artifacts), or unreadable embeddings. I request an enhancement to support automatic encoding detection and conversion in the Haystack PDF parsing component and explicit encoding selection options.
Describe the solution you'd like
Enhance the PDF parsing components by:
Auto-detecting encoding before indexing using libraries like chardet or cchardet.
Providing an explicit encoding parameter (e.g., encoding="utf-8" or encoding="auto") in PDFToTextConverter, PDFPlumberConverter, and PyMuPDFConverter.
Converting extracted text to UTF-8 before it is passed to the embedding pipeline.
The text was updated successfully, but these errors were encountered: