#99 PdfExtractor fails on Adobe Illustrator files

closed-invalid
Antoni Mylka
general (25)
5
2009-09-25
2009-08-19
No

When running tests with of the XMPExtractor, when an Adobe Illustrator file is encountered, the mime type is detected as application/pdf and the PdfExtractor will fail with the following exception:

org.semanticdesktop.aperture.extractor.ExtractorException: java.io.IOException: Error getting pdf version:java.lang.NumberFormatException: For input string: "°¢"
at org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extract(PdfExtractor.java:58)
at org.semanticdesktop.aperture.extractor.xmp.XMPTestHelper.extractMetadataRegistry(XMPTestHelper.java:100)
at org.semanticdesktop.aperture.extractor.xmp.XMPExtractorSDKSamplesTest.testXMPSDKSamplesWitRegistry(XMPExtractorSDKSamplesTest.java:98)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:73)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:46)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
at org.junit.runners.Suite.runChild(Suite.java:115)
at org.junit.runners.Suite.runChild(Suite.java:23)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:180)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:41)
at org.junit.runners.ParentRunner$1.evaluate(ParentRunner.java:173)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at org.junit.runners.ParentRunner.run(ParentRunner.java:220)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:46)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.io.IOException: Error getting pdf version:java.lang.NumberFormatException: For input string: "°¢"
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:166)
at org.semanticdesktop.aperture.extractor.pdf.PdfExtractor.extract(PdfExtractor.java:54)
... 32 more

The offending file can be found in the XMP 4.4 toolkit here: http://www.adobe.com/devnet/xmp/

The file is located in the samples folder and it is named BlueSquare.ai.

Discussion

  • According to filext.com, pdf files start with "%PDF-1." and Illustrator files start with "%!PS-". However, this particular .ai file starts with "%PDF-1.4", so it matches the PDF magic number. When I rename it so that it has a .pdf extension, it also opens fine in Acroread.

    I don't have Illustrator myself, can you check whether other .ai files (preferably obtained from a variety of sources) also have the PDF magic number? If this is the case, then what we can do is create an entry for .ai files in mimetypes.xml as a subtype of PDF: if it matches the PDF magic number and has a .ai file extension, then it will get classified as an Illustrator file. However, I am only comfortable with this fix if it covers the general case of .ai files.

    BTW: when I drop the file on Aperture's File Inspector (which does not have the XMPExtractor yet), I don't get any stacktrace. All metadata is extracted fine by PdfExtractor.

     
  • I created a test for this file (see my comment to issue number 2839990). It works with PdfExtractor. Please examine your code that does this test (or paste it here). Otherwise I'd close this issue.

     
  • Antoni Mylka
    Antoni Mylka
    2009-09-25

    I committed the test in rev 2077. This confirms that the PDF extractor does work correctly on the Adobe Illustrator file. There was no reaction for more than two weeks. I therefore close this issue as 'invalid'. Ryan, if you do manage to reproduce that exception with the current trunk, please reopen.

     
  • Antoni Mylka
    Antoni Mylka
    2009-09-25

    • milestone: 893322 -->
    • assigned_to: nobody --> mylka
    • status: open --> closed-invalid