PDF module error with TeX-created documents
File validation and characterization
Brought to you by:
carlwilson-bl,
garymcgath
User Chris Yocum reports:
Anyway, here is the output that I am getting. You can try this on any TeX generated document and it should give you the same results.
java.lang.ClassCastException:
edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to
edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
at
edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown
Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)
Could you attach a file that exhibits this problem?
Does this work for you with JHOVE 1.8?
I can confirm this bug, although the file is not TeX-generated, but from Acrobat Distiller. The file is attached.
Here is my complete output:
JHOVE is getting caught because it's seeing a keyword where it expects a font dictionary in a page node's resources. As far as I can tell from reading the spec, this is incorrect PDF. I've fixed it so that instead of throwing an exception it reports that it failed to see a font dictionary. This is in the checked-in PdfModule.java.
This seems to imply that many TeX-generated PDFs are broken. If there's something I've missed and a keyword object is valid in this context, please let me know. At least now the error message is more to the point, and there won't be a stack dump.
The fix doesn't seem to cover all cases. I was able to create a PDF file using pdfLaTeX which recreates the crash in 1.10b2. The crash is triggered as soon as I include the MinionPro font (i.e. commenting the MinionPro package makes jHove run ok):
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[lf]{MinionPro}
\begin{document}
ABC
\end{document}
The output looks like this:
java.lang.ClassCastException: edu.harvard.hul.ois.jhove.module.pdf.PdfSimpleObject cannot be cast to edu.harvard.hul.ois.jhove.module.pdf.PdfDictionary
at edu.harvard.hul.ois.jhove.module.PdfModule.readDocCatalogDict(Unknown Source)
at edu.harvard.hul.ois.jhove.module.PdfModule.parse(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.processFile(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.process(Unknown Source)
at edu.harvard.hul.ois.jhove.JhoveBase.dispatch(Unknown Source)
at Jhove.main(Unknown Source)
Jhove (Rel. 1.9, 2013-05-28)
Date: 2013-06-05 10:08:04 CEST
RepresentationInformation: /tmp/test.pdf
ReportingModule: PDF-hul, Rel. 1.7 (2012-08-12)
LastModified: 2013-06-05 10:00:09 CEST
Size: 42554
Format: PDF
Status: Not well-formed
SignatureMatches:
PDF-hul
ErrorMessage: No document catalog dictionary
Offset: 0
MIMEtype: application/pdf
BTW, both the version from CVS and the tar-ball report version number 1.9 instead of 1.10b2 or something else.
Re Thomas Fischer: I'm not getting a crash, and it looks from the output you've posted as if JHOVE is in fact running to completion after writing out a stack dump. However, JHOVE isn't processing the file properly, or else it's broken and Acrobat is able to open it anyway. (This may hinge on fine points of what "broken" means.) I'm seeing that in trying to read the document catalog dictionary, JHOVE is instead getting a keyword of "rstChar". This is most likely a fragment of a "FirstChar" keyword.
There is legitimately a bug, but I'm afraid it will have to stay open for version 1.10. Hopefully I or someone else will find a fix for it later.
Hi,
is this bug still present in current version of JHOVE 1.11?
Best regards.
Moved to GitHub for triage and testing