Hello
I am evaluating PDFBox 7.0.13 to extract images out of a bunch of PDF files. These PDF files are all scanned documents. The graphics will then be passed to an OCR program to extract the text.
During the execution, about 15% of the documents fail with 2 types of errors:
-------------------------------------------------
java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:501)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:354)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:128)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
Failed to process - reason: Failed to parse file
-------------------------------------------------
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:154)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:166)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
-------------------------------------------------
My problem is that these documents are classified, so I cannot submit a test case.
Basically, I have 2 questions:
1. since these problem always occur at the same address, can you identify the problem without a test case?
2. does the CVS version (7.0.14) contain a fix for these problems?
Best regards
JP
dev@softpark.ws
Logged In: NO
I run the same tests using the PDFBox-0.7.4-dev-20080223 version. The first error has been replaced by a new one:
Processing java.lang.NullPointerException
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.buildHeader(PDCcitt.java:529)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:372)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt$TiffWrapper.<init>(PDCcitt.java:363)
at org.pdfbox.pdmodel.graphics.xobject.PDCcitt.write2OutputStream(PDCcitt.java:137)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
The second error still occurs:
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.pdfbox.pdmodel.graphics.predictor.None.decode(None.java:71)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.getRGBImage(PDPixelMap.java:173)
at org.pdfbox.pdmodel.graphics.xobject.PDPixelMap.write2OutputStream(PDPixelMap.java:190)
at PDFBox1.parseDocument(PDFBox1.java:237)
at PDFBox1.processAll(PDFBox1.java:108)
at PDFBox1.main(PDFBox1.java:468)
I used 5000 files for the test and about 10% fail with one of these two exceptions.
Any solution or should I use another library to extract graphics out of PDF files?
Best regards
JP
dev@softpark.ws