getKids() Null Pointer Exception when parsing pdf
Brought to you by:
benlitchfield
Here is the top of the stack trace when parsing some
PDF documents.:
java.lang.NullPointerException
at
org.pdfbox.pdmodel.PDPageNode.getKids(PDPageNode.java:171)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getPageObjects(PDDocumentCatalog
.java:133)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.ja
va:127)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:120)
Logged In: YES
user_id=1080046
I've noticed the same problem. Curiously enough I've had the
same PDF file process correctly and then at another time give
this error.
Logged In: YES
user_id=601708
Please attach/link/email a PDF document that has this
problem. I cannot reproduce without a PDF.
Ben
Logged In: YES
user_id=1080046
G'day Ben,
Sorry for the delay in getting back to you; I've been on
holidays. In any case, I'm actually back on this project and
as I said when I index certain PDFs I get an error once, and
then when I try again I do not get the same problem.
Here's a stack trace with the problem (it include a link to one
of the PDF's in question.)
27/10/2004 10:36:41 ERROR::pub=1195: Null Pointer
Exception when creating Lucene document
(http://www.rta.nsw.gov.au/licensing/downloads/ruh_english.
pdf)! (class java.lang.NullPointerException:null). Not indexed.
27/10/2004 10:36:41 ERROR::null
java.lang.NullPointerException
at org.pdfbox.pdmodel.PDPageNode.getKids
(PDPageNode.java:171)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getPageObjects
(PDDocumentCatalog.java:168)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages
(PDDocumentCatalog.java:129)
at org.pdfbox.util.PDFTextStripper.writeText
(PDFTextStripper.java:161)
at
gov.nsw.oit.bookshop.searchengine.CimLucenePdfDocument.a
ddContent(CimLucenePdfDocument.java:241)
at
gov.nsw.oit.bookshop.searchengine.CimLucenePdfDocument.g
etDocument(CimLucenePdfDocument.java:152)
at
gov.nsw.oit.bookshop.searchengine.PublicationDocument.getD
ocument(PublicationDocument.java:66)
at
gov.nsw.oit.bookshop.searchengine.PublicationIndexer.addDoc
ument(PublicationIndexer.java:341)
at
gov.nsw.oit.bookshop.searchengine.PublicationIndexer.indexPu
blications(PublicationIndexer.java:228)
at
gov.nsw.oit.bookshop.searchengine.PublicationIndexer.main
(PublicationIndexer.java:162)
It's not a major issue for me at the moment as the PDFs do
get indexed on subsequent indexing calls, however it would
be better if it just worked first time around.
Cheers for you help,
Brendan
Logged In: YES
user_id=601708
When I run "java org.pdfbox.ExtractText tmp\ruh_english.pdf"
I do not get the exception, please give the nightly release a
try and let me know if you still have this issue.
Ben
Logged In: YES
user_id=1080046
Hi Ben,
I installed the nightly build (btw, prior to this I was using the
latest version available on SourceForge, v0.6.7a) and still got
the same problem.
You'll notice that I'm not using the java
org.pdfbox.ExtractText method to cause this error. I'm
actually accessing this PDF using a class that is based on the
LucenePDFDocument you provided. (I had to add a few extra
fields such as number of pages, renaming your keys and
modifying the getUid method. I could have extended but it
was difficult given that I wanted to change the keys you
used.) Other than these few minor things the
CimLucenePdfDocument you see in the exception stack trace
is the same as your class. I was getting the same problem
when using you class (before I created my new class.)
Line 241 in my class refers to the line saying
stripper.writeText(pdfDocument, writer);
I can send you a copy of CimLucenePdfDocument if you
require it.
Cheers for yuo help. Regards,
Brendan
Logged In: YES
user_id=601708
Is this still a problem? I don't have a test case that
reproduces this issue.
Ben
Logged In: YES
user_id=1080046
G'day Ben,
This problem was still occuring with the document I attached
earlier. It was only occuring when trying to access the
document over an internet connection, so maybe that had
something to do with it.
Unfortunately I'm no longer working for the company I was
invovvled with this project so I cannot give you any more
help right now.
Best of luck,
Brendan
Logged In: YES
user_id=601708
I do not have a test case that exhibits this problem so I am
closing this case, please reopen if this is still an issue.
Ben
Logged In: YES
user_id=1088404
I think I've got the same or a very similar problem; see stack
trace below. There are some things of note in my case
(1) The NPE is only thrown for http-URLs
(2) The exception is not always thrown. If I try often enough to
access a document it will eventually succeed, sometimes taking
20+ attempts.
(3) When running in the debugger (Eclipse), everything works
fine; no NPEs.
Michael
java.lang.NullPointerException
at
org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
at
org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:131)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
at
org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocument.java:261)
at
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:221)
Logged In: YES
user_id=601708
Alright, I reopened this case, maybe there is a parsing
problem when using a URL. Not sure what it could be though
off the top of my head, I just use an InputStream.
Do you mean that it consistently fails when using a URL, and
only sometimes works? If it is a public url can you give it to
me, so I can test using the same scenario.
Ben
Logged In: YES
user_id=1088404
No, I mean the opposite. After retrying often enough every URL
will finally get processed correctly, thus I can give no specific URL
as none fails consistently.
Further downstream, the cause of the NPE is that PDPageNode
is constructed with a null argument. I have no idea why this
happens, and what's more, why it only happens intermittently.
However, that doesn't mean the NPE occurs only rarely. I've
used LucenePDFDocument to index a list of about 100 remote
documents. It took around 30 passes through the list until finally
each and every document was indexed; on each pass I got NPEs
for up to 90% of the documents.
As I wrote, when run in the Eclipse debugger, everything works
flawlessly, not a single NPE or other exception. I have no idea,
why this is. I've tried to put the thread to sleep between
accessing the documents and also tried to GC'ing, but that kind
of voodoo didn't change anything.
I've had a look at how data is read from the InputStream and I
can see no obvious(!) problem there, but I have a hunch that
the problem is somewhere in the vicinity.
Michael
Logged In: YES
user_id=1275162
The problem is that InputStream.read(byte b[], int off, int len)
reads only number of bytes which are currently available in
buffers. This number can be less then value of variable len. It
is often case of Internet streams.
The following code in PDFParser.parseObject()
if( pdfSource.available() < 1000 )
{
byte[] data = new byte[ 1000 ];
int amountRead = pdfSource.read( data );
has problem with that.
1. pdfSource.available() do not return number of bytes till end
of stream
2. when pdfSource.read( data ) reads e.g. 1 byte it does not
mean that there is no more bytes!
I used following solution:
1. I have added following method into
org.pdfbox.io.PushBackInputStream
which assures that all available bytes are read. Method read
is used quite often from several places and we should be
sure that all available bytes are read.
/**
* read aLen till EOF
*/
public int read(byte[] aB, int aOff, int aLen) throws
IOException
{
int loTotalLength = super.read(aB, aOff, aLen);
if(loTotalLength==-1)
{ //EOF
return -1;
}
while(loTotalLength<aLen)
{
int loL = super.read(aB,
aOff+loTotalLength, aLen-loTotalLength);
if(loL==-1)
{ //EOF
break;
}
loTotalLength+=loL;
}
return loTotalLength;
}
2. I modified PDFParser.parseObject() this way:
...
int amountRead = pdfSource.read( data );
if( amountRead != -1 )
{
pdfSource.unread( data, 0, amountRead );
}
if(amountRead<1000)
{//if there is less then 1000 bytes then it can be really end of
file
boolean atEndOfFile = true;
...
Logged In: YES
user_id=601708
This is now fixed, woo hoo!!
Cheers,
Ben