Menu

#69 'startxref' keyword not found

0.1.2.1
closed-fixed
None
5
2015-05-21
2015-05-14
No

Can't extract text from the attached PDF

Error:

EXCEPTION: org.pdfclown.util.parsers.PostScriptParseException: 'startxref' keyword not found.
   at org.pdfclown.tokens.FileParser.RetrieveXRefOffset()
   at org.pdfclown.tokens.Reader.ReadInfo()
   at org.pdfclown.files.File..ctor(IInputStream stream)
   at org.pdfclown.files.File..ctor(String path)
   at Digitaldoc.WebAPI.Services.Extractors.PdfToText.Extract()
1 Attachments

Discussion

  • Stefano Chizzolini

    Hi,

    I examined your PDF file and found that is corrupted by alien data in its tail (it apparently seems part of an HTML file):

    startxref
    2751046
    %%EOF
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html>
    <head>
    <title>lapsi.latinoware.org</title>
    <meta http-equiv="Pragma" content="no-cache" />
    <meta http-equiv="charset" content="UTF-8" />
    <link rel="stylesheet" type="text/css" href="/assets/3d252854/themes/custom-theme/jquery-ui.custom.css" />
    <link rel="stylesheet" type="text/css" href="/assets/3d252854/jquery.jgrowl.css" />
    <link rel="stylesheet" type="text/css" href="/themes/2010/common/Stylesheets.css" />
    <script type="text/javascript" src="/assets/3d252854/jquery.min.js"></script>
    <script type="text/javascript" src="/assets/3d252854/initJQuery.js"></script>
    <script type="text/javascript" src="/assets/3d252854/jquery-ui.custom.min.js"></script>
    [ ** MORE ALIEN DATA HERE ** ]
    

    The behavior of PDF Clown is conformant with the PDF 1.7 specification, which prescribes at point 18, Appendix H, that "Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file". As the alien data in your file is more than 1KB, the %%EOF marker is outside the legitimate scope of the parser, hence it fails.

    Investigating the behavior of other readers, I noted that recent versions of Acrobat and Ghostscript are pretty relaxed and accept your file without crying; others, like Poppler-based Evince, notify that the xref table is corrupted and activate a recovery algorithm.

    I'm going to immediately introduce a resilient condition to let PDF Clown accept such abnormal files.

    thank you
    Stefano

     

    Last edit: Stefano Chizzolini 2015-05-21
  • Stefano Chizzolini

    Corrupted files with alien data in the trailing section were not supported.

    Fixed on 0.1.2-Fix branch (rev 212) and 0.2.0 trunk (rev 213).

    thank you

     
  • Stefano Chizzolini

    • status: open --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB