PDF Clown / Bugs / #69 'startxref' keyword not found

#69 'startxref' keyword not found

Milestone: 0.1.2.1

Status: closed-fixed

Owner: Stefano Chizzolini

Labels: None

Priority: 5

Updated: 2015-05-21

Created: 2015-05-14

Creator: Willyan Klumb

Private: No

Can't extract text from the attached PDF

Error:

EXCEPTION: org.pdfclown.util.parsers.PostScriptParseException: 'startxref' keyword not found.
   at org.pdfclown.tokens.FileParser.RetrieveXRefOffset()
   at org.pdfclown.tokens.Reader.ReadInfo()
   at org.pdfclown.files.File..ctor(IInputStream stream)
   at org.pdfclown.files.File..ctor(String path)
   at Digitaldoc.WebAPI.Services.Extractors.PdfToText.Extract()

1 Attachments

startxref keyword not found.pdf

Discussion

Stefano Chizzolini - 2015-05-21

Hi,

I examined your PDF file and found that is corrupted by alien data in its tail (it apparently seems part of an HTML file):

startxref 2751046 %%EOF <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <head> <title>lapsi.latinoware.org</title> <meta http-equiv="Pragma" content="no-cache" /> <meta http-equiv="charset" content="UTF-8" /> <link rel="stylesheet" type="text/css" href="/assets/3d252854/themes/custom-theme/jquery-ui.custom.css" /> <link rel="stylesheet" type="text/css" href="/assets/3d252854/jquery.jgrowl.css" /> <link rel="stylesheet" type="text/css" href="/themes/2010/common/Stylesheets.css" /> <script type="text/javascript" src="/assets/3d252854/jquery.min.js"></script> <script type="text/javascript" src="/assets/3d252854/initJQuery.js"></script> <script type="text/javascript" src="/assets/3d252854/jquery-ui.custom.min.js"></script> [ ** MORE ALIEN DATA HERE ** ]

The behavior of PDF Clown is conformant with the PDF 1.7 specification, which prescribes at point 18, Appendix H, that "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file". As the alien data in your file is more than 1KB, the %%EOF marker is outside the legitimate scope of the parser, hence it fails.

Investigating the behavior of other readers, I noted that recent versions of Acrobat and Ghostscript are pretty relaxed and accept your file without crying; others, like Poppler-based Evince, notify that the xref table is corrupted and activate a recovery algorithm.

I'm going to immediately introduce a resilient condition to let PDF Clown accept such abnormal files.

thank you
Stefano

Last edit: Stefano Chizzolini 2015-05-21
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano Chizzolini - 2015-05-21

Corrupted files with alien data in the trailing section were not supported.

Fixed on 0.1.2-Fix branch (rev 212) and 0.2.0 trunk (rev 213).

thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefano Chizzolini - 2015-05-21

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

'startxref' keyword not found

General-Purpose PDF Library for Java and .NET

Group

Searches

Help

#69 'startxref' keyword not found

Discussion