'startxref' keyword not found
General-Purpose PDF Library for Java and .NET
Status: Beta
Brought to you by:
stechio
Can't extract text from the attached PDF
Error:
EXCEPTION: org.pdfclown.util.parsers.PostScriptParseException: 'startxref' keyword not found. at org.pdfclown.tokens.FileParser.RetrieveXRefOffset() at org.pdfclown.tokens.Reader.ReadInfo() at org.pdfclown.files.File..ctor(IInputStream stream) at org.pdfclown.files.File..ctor(String path) at Digitaldoc.WebAPI.Services.Extractors.PdfToText.Extract()
Hi,
I examined your PDF file and found that is corrupted by alien data in its tail (it apparently seems part of an HTML file):
The behavior of PDF Clown is conformant with the PDF 1.7 specification, which prescribes at point 18, Appendix H, that "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file". As the alien data in your file is more than 1KB, the %%EOF marker is outside the legitimate scope of the parser, hence it fails.
Investigating the behavior of other readers, I noted that recent versions of Acrobat and Ghostscript are pretty relaxed and accept your file without crying; others, like Poppler-based Evince, notify that the xref table is corrupted and activate a recovery algorithm.
I'm going to immediately introduce a resilient condition to let PDF Clown accept such abnormal files.
thank you
Stefano
Last edit: Stefano Chizzolini 2015-05-21
Corrupted files with alien data in the trailing section were not supported.
Fixed on 0.1.2-Fix branch (rev 212) and 0.2.0 trunk (rev 213).
thank you