#32 PDF module error with TeX-created documents

None
pending-fixed
Gary McGath
Modules (29)
5
2013-11-02
2012-02-28
Gary McGath
No

User Chris Yocum reports:

Anyway, here is the output that I am getting. You can try this on any TeX generated document and it should give you the same results.

java.lang.ClassCastException:
edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to
edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
at
edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown
Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)

Discussion

  • Gary McGath
    Gary McGath
    2012-09-05

    Could you attach a file that exhibits this problem?

     
  • Gary McGath
    Gary McGath
    2012-11-09

    Does this work for you with JHOVE 1.8?

     
  • Gary McGath
    Gary McGath
    2012-11-09

    • status: open --> pending
     
  • Thomas Fischer
    Thomas Fischer
    2013-03-04

    I can confirm this bug, although the file is not TeX-generated, but from Acrobat Distiller. The file is attached.
    Here is my complete output:

    Jhove (Rel. 1.9, 2012-12-17)
     Date: 2013-03-04 13:59:26 CET
     RepresentationInformation: b6c99639fc62e6a7430b78f6d8494931_http___www_bolagsverket_se_polopoly_fs_1_5530__Menu_general_column_content_file_p25_personinformation.pdf
      ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
      LastModified: 2013-01-04 12:22:13 CET
      Size: 80219
      Format: PDF
      Version: 1.6
      Status: Not well-formed
      SignatureMatches:
       PDF-hul
      ErrorMessage: Unexpected error in findFonts: java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
       Offset: 1849
      MIMEtype: application/pdf
      PDFMetadata: 
       Objects: 0
       FreeObjects: 1
       IncrementalUpdates: 0
       DocumentCatalog: 
        PageLayout: SinglePage
        PageMode: UseNone
       Filters: 
        FilterPipeline: FlateDecode
       Fonts: 
        TrueType: 
         Font: 
          BaseFont: CBMFOF+Garamond
          FontSubset: true
          FirstChar: 32
          LastChar: 246
          FontDescriptor: 
           FontName: CBMFOF+Garamond
           Flags: Serif, Nonsymbolic
           FontBBox: -139, -307, 1063, 986
           FontFile2: true
          Encoding: WinAnsiEncoding
       XMP: <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.2-c001 63.139439, 2010/09/27-13:37:26        ">
       <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
          <rdf:Description rdf:about=""
                xmlns:dc="http://purl.org/dc/elements/1.1/">
             <dc:format>application/pdf</dc:format>
             <dc:creator>
                <rdf:Seq>
                   <rdf:li>Bolagsverket</rdf:li>
                </rdf:Seq>
             </dc:creator>
             <dc:title>
                <rdf:Alt>
                   <rdf:li xml:lang="x-default">Produktbeskrivning P25_Personinformation</rdf:li>
                </rdf:Alt>
             </dc:title>
          </rdf:Description>
          <rdf:Description rdf:about=""
                xmlns:xmp="http://ns.adobe.com/xap/1.0/">
             <xmp:CreateDate>2008-10-13T15:55:07+02:00</xmp:CreateDate>
             <xmp:CreatorTool>PScript5.dll Version 5.2.2</xmp:CreatorTool>
             <xmp:ModifyDate>2012-08-17T15:56:07+02:00</xmp:ModifyDate>
             <xmp:MetadataDate>2012-08-17T15:56:07+02:00</xmp:MetadataDate>
          </rdf:Description>
          <rdf:Description rdf:about=""
                xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
             <pdf:Producer>Acrobat Distiller 8.1.0 (Windows)</pdf:Producer>
          </rdf:Description>
          <rdf:Description rdf:about=""
                xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
             <xmpMM:DocumentID>uuid:c90d60fd-280e-4af3-bf14-87f96badb896</xmpMM:DocumentID>
             <xmpMM:InstanceID>uuid:dde7d516-b11d-4d86-be2a-5cc56c489a1d</xmpMM:InstanceID>
          </rdf:Description>
       </rdf:RDF>
    </x:xmpmeta>
       Pages: 
        Page: 
         Label: 1
        Page: 
         Label: 2
        Page: 
         Label: 3
        Page: 
         Label: 4
        Page: 
         Label: 5
        Page: 
         Label: 6
        Page: 
         Label: 7
    
     
  • Gary McGath
    Gary McGath
    2013-03-04

    JHOVE is getting caught because it's seeing a keyword where it expects a font dictionary in a page node's resources. As far as I can tell from reading the spec, this is incorrect PDF. I've fixed it so that instead of throwing an exception it reports that it failed to see a font dictionary. This is in the checked-in PdfModule.java.

    This seems to imply that many TeX-generated PDFs are broken. If there's something I've missed and a keyword object is valid in this context, please let me know. At least now the error message is more to the point, and there won't be a stack dump.

     
  • Gary McGath
    Gary McGath
    2013-03-04

    • status: pending --> pending-fixed
    • milestone: -->
     
  • Thomas Fischer
    Thomas Fischer
    2013-06-05

    The fix doesn't seem to cover all cases. I was able to create a PDF file using pdfLaTeX which recreates the crash in 1.10b2. The crash is triggered as soon as I include the MinionPro font (i.e. commenting the MinionPro package makes jHove run ok):

    \documentclass{article}
    \usepackage[T1]{fontenc}
    \usepackage[utf8]{inputenc}
    \usepackage[lf]{MinionPro}

    \begin{document}
    ABC
    \end{document}

    The output looks like this:

    java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
    at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown Source)
    at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
    at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
    at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
    at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
    at Jhove.main(Unknown Source)
    Jhove (Rel. 1.9, 2013-05-28)
    Date: 2013-06-05 10:08:04 CEST
    RepresentationInformation: /tmp/test.pdf
    ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
    LastModified: 2013-06-05 10:00:09 CEST
    Size: 42554
    Format: PDF
    Status: Not well-formed
    SignatureMatches:
    PDF-hul
    ErrorMessage: No document catalog dictionary
    Offset: 0
    MIMEtype: application/pdf

    BTW, both the version from CVS and the tar-ball report version number 1.9 instead of 1.10b2 or something else.

     
    Attachments
  • Gary McGath
    Gary McGath
    2013-06-05

    Re Thomas Fischer: I'm not getting a crash, and it looks from the output you've posted as if JHOVE is in fact running to completion after writing out a stack dump. However, JHOVE isn't processing the file properly, or else it's broken and Acrobat is able to open it anyway. (This may hinge on fine points of what "broken" means.) I'm seeing that in trying to read the document catalog dictionary, JHOVE is instead getting a keyword of "rstChar". This is most likely a fragment of a "FirstChar" keyword.

    There is legitimately a bug, but I'm afraid it will have to stay open for version 1.10. Hopefully I or someone else will find a fix for it later.

     
  • Denis Bitouzé
    Denis Bitouzé
    2013-11-02

    There is legitimately a bug, but I'm afraid it will have to stay open for version 1.10. Hopefully I or someone else will find a fix for it later.

    Hi,

    is this bug still present in current version of JHOVE 1.11?

    Best regards.