#27 ClassCastException issue when extracting graphics

open
nobody
None
5
2008-06-17
2008-06-17
Anonymous
No

Hello

I am evaluating PDFBox 7.0.13 to extract images out of a bunch of PDF files. These PDF files are all scanned documents. The graphics will then be passed to an OCR program to extract the text.
During the execution, about 15% of the documents fail with 2 types of errors:
-------------------------------------------------
java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:501)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:354)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:128)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
Failed to process - reason: Failed to parse file
-------------------------------------------------
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:154)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:166)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
-------------------------------------------------
My problem is that these documents are classified, so I cannot submit a test case.
Basically, I have 2 questions:
1. since these problem always occur at the same address, can you identify the problem without a test case?
2. does the CVS version (7.0.14) contain a fix for these problems?

Best regards

JP
dev@softpark.ws

Discussion

  • Logged In: NO

    I run the same tests using the PDFBox-0.7.4-dev-20080223 version. The first error has been replaced by a new one:
    Processing java.lang.NullPointerException
    at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:529)
    at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:372)
    at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
    at org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:137)
    at PDFBox1.parseDocument(PDFBox1.java:237)
    at PDFBox1.processAll(PDFBox1.java:108)
    at PDFBox1.main(PDFBox1.java:468)
    The second error still occurs:
    java.lang.ArrayIndexOutOfBoundsException
    at java.lang.System.arraycopy(Native Method)
    at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
    at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:173)
    at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:190)
    at PDFBox1.parseDocument(PDFBox1.java:237)
    at PDFBox1.processAll(PDFBox1.java:108)
    at PDFBox1.main(PDFBox1.java:468)
    I used 5000 files for the test and about 10% fail with one of these two exceptions.
    Any solution or should I use another library to extract graphics out of PDF files?

    Best regards

    JP
    dev@softpark.ws