JHove diagnosing error messages from the PDF-hul module

2014-07-10
2014-07-11
  • Larry Baker
    Larry Baker
    2014-07-10

    I have a PDF-1.1 file that was originally written by a plotting program I wrote. Adobe Reader and Acrobat complain and will not read the file; Mac OS X Preview reads the file just fine. By default, JHove categorizes it as an ASCII text file. I found Edit->Select module->PDF-hul and selected it.

    The PDF-hul module says

    Messages->ErrorMessage: Invalid page tree node, Offset: 566448
    

    File offset 566448 is at the boundary between PDF objects 7 and 31, below:

    7 0 obj
    <<
       /Type /Pages
       /Count 8
       /Parent 32
       /Kids [ 6 0 R 10 0 R 13 0 R 16 0 R 19 0 R 22 0 R 25 0 R 28 0 R ]
    >>
    endobj
    31 0 obj
    <<
       /Type /Page
       /Parent 33 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources <<
          /ProcSet [ /PDF /Text ]
          /Font <<
             /Helvetica 5 0 R
          >>
       >>
       /Contents 29 0 R
    >>
    endobj
    

    I assume the error is found after object 7 is read, which is a page tree node.

    Page tree nodes are pretty simple, especially in PDF-1.1. I do not see why PDF-hul thinks there is a problem.

    The file has 10 pages. The top level of the page tree has /Count 10 and two /Kids. The first /Kids is object 7. Object 7 has eight pages; there is another object in the page tree that contains the other two. I'm a bit suspicious because I put up to eight children in a page tree node. Adobe's PDF V1.0 reference manual says their products don't put more than six children in a page tree node. Supposedly, the number of children does not matter.

    I also wrote tools to parse and pretty-print PDF files. Here's what the object tree (structure only; actual page contents are not printed) looks like for this file, printed in the order the tree would be walked:

    $ pdf_object_tree fpsli2570.pdf 
    trailer => <<
       /Size 37
       /Root 2 0 R
       /Info 1 0 R
    >>
    2 0 R => <<
       /Type /Catalog
       /Pages 32 0 R
    >>
    32 0 R => <<
       /Type /Pages
       /Count 10
       /Kids [ 7 0 R 33 0 R ]
    >>
    7 0 R => <<
       /Type /Pages
       /Count 8
       /Parent 32
       /Kids [ 6 0 R 10 0 R 13 0 R 16 0 R 19 0 R 22 0 R 25 0 R 28 0 R ]
    >>
    6 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 3 0 R
    >>
    5 0 R => <<
       /Type /Font
       /Subtype /Type1
       /BaseFont /Helvetica
    >>
    3 0 R => <<
       /Length 4 0 R
    >>
    4 0 R => 62484
    10 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 8 0 R
    >>
    8 0 R => <<
       /Length 9 0 R
    >>
    9 0 R => 63411
    13 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 11 0 R
    >>
    11 0 R => <<
       /Length 12 0 R
    >>
    12 0 R => 56707
    16 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 14 0 R
    >>
    14 0 R => <<
       /Length 15 0 R
    >>
    15 0 R => 61193
    19 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 17 0 R
    >>
    17 0 R => <<
       /Length 18 0 R
    >>
    18 0 R => 59732
    22 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 20 0 R
    >>
    20 0 R => <<
       /Length 21 0 R
    >>
    21 0 R => 63397
    25 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 23 0 R
    >>
    23 0 R => <<
       /Length 24 0 R
    >>
    24 0 R => 66679
    28 0 R => <<
       /Type /Page
       /Parent 7 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 26 0 R
    >>
    26 0 R => <<
       /Length 27 0 R
    >>
    27 0 R => 66035
    33 0 R => <<
       /Type /Pages
       /Count 2
       /Parent 32
       /Kids [ 31 0 R 36 0 R ]
    >>
    31 0 R => <<
       /Type /Page
       /Parent 33 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 29 0 R
    >>
    29 0 R => <<
       /Length 30 0 R
    >>
    30 0 R => 62585
    36 0 R => <<
       /Type /Page
       /Parent 33 0 R
       /MediaBox [ -18 -18 554 562 ]
       /Resources << /ProcSet [ /PDF /Text ] /Font << /Helvetica 5 0 R >> >>
       /Contents 34 0 R
    >>
    34 0 R => <<
       /Length 35 0 R
    >>
    35 0 R => 54164
    1 0 R => <<
       /Type /Info
       /Creator (USGS Viewer.PostScript V3.15)
       /Author (SAMOA::KLEIN \(SEIS [333,1]\))
       /Title (PUB:[KLEIN.KIHOLO]POSTSCRIPT.PDF;1)
       /CreationDate (D:20110317101434)
       /ModDate (D:20110317101434)
    >>
    

    I must be missing something obvious. My problem is, neither the Adobe Reader or Acrobat, or JHove's PDF-hul module tell me what they don't like. I just want to know why?

    If anyone can help, I would appreciate it.

    Thank you,

    Larry Baker
    US Geological Survey

    P.S. The user that sent me this file cannot recreate it. I read it on my Mac with Preview and did a Save As... to PDF, which rewrote the file in PDF-1.3, which made Adobe Reader happy.

     
  • Gary McGath
    Gary McGath
    2014-07-10

    The PDF module has a history of bugs relating to page trees, and this could be one more. If other software doesn't complain, I'd be inclined to call this a JHOVE bug.

    JHOVE is ten years old, and I'm only occasionally playing with the code. There was talk about the Open Planets Foundation's picking it up, but I haven't gotten any responses on that lately. The person who was interested may have moved to another job.

    Incidentally, current active work on JHOVE (such as it is) is on GitHub, in case anyone wants to pick up on this.

     
  • Larry Baker
    Larry Baker
    2014-07-11

    Gary,

    Thanks. My problem is Adobe Reader and Adobe Acrobat won't read the file I've got, and I they won't tell me why. Unfortunately, JHOVE doesn't say what it doesn't like either. Trouble is, this file is several years old and it would be very difficult for the scientist that wants to read it to recreate it. Luckily, Apple's Preview on my desktop Mac read it just fine.

    This is a PDF-1.1 file. I wrote the plotting library he used that wrote it. PDF-1.1 is from 1995; it is really simple. I just don't see what's wrong. Adobe has a bad habit of ignoring their own published specs. If I knew what would make Reader happy, I would hack my code to get past this. It remains an academic question now. I don't like it when my stuff that doesn't work. Even if it is not my fault.

    Larry Baker
    US Geological Survey