Menu

#74 getKids() Null Pointer Exception when parsing pdf

closed-fixed
parsing (91)
5
2006-01-28
2004-06-17
Steve
No

Here is the top of the stack trace when parsing some
PDF documents.:

java.lang.NullPointerException
at
org.pdfbox.pdmodel.PDPageNode.getKids(PDPageNode.java:171)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getPageObjects(PDDocumentCatalog
.java:133)
at
org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.ja
va:127)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:120)

Discussion

  • Brendan Walker

    Brendan Walker - 2004-07-12

    Logged In: YES
    user_id=1080046

    I've noticed the same problem. Curiously enough I've had the
    same PDF file process correctly and then at another time give
    this error.

     
  • Ben Litchfield

    Ben Litchfield - 2004-08-31
    • assigned_to: nobody --> benlitchfield
     
  • Ben Litchfield

    Ben Litchfield - 2004-09-24

    Logged In: YES
    user_id=601708

    Please attach/link/email a PDF document that has this
    problem. I cannot reproduce without a PDF.

    Ben

     
  • Brendan Walker

    Brendan Walker - 2004-10-27

    Logged In: YES
    user_id=1080046

    G'day Ben,

    Sorry for the delay in getting back to you; I've been on
    holidays. In any case, I'm actually back on this project and
    as I said when I index certain PDFs I get an error once, and
    then when I try again I do not get the same problem.

    Here's a stack trace with the problem (it include a link to one
    of the PDF's in question.)

    27/10/2004 10:36:41 ERROR::pub=1195: Null Pointer
    Exception when creating Lucene document
    (http://www.rta.nsw.gov.au/licensing/downloads/ruh_english.
    pdf)! (class java.lang.NullPointerException:null). Not indexed.
    27/10/2004 10:36:41 ERROR::null
    java.lang.NullPointerException
    at org.pdfbox.pdmodel.PDPageNode.getKids
    (PDPageNode.java:171)
    at
    org.pdfbox.pdmodel.PDDocumentCatalog.getPageObjects
    (PDDocumentCatalog.java:168)
    at
    org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages
    (PDDocumentCatalog.java:129)
    at org.pdfbox.util.PDFTextStripper.writeText
    (PDFTextStripper.java:161)
    at
    gov.nsw.oit.bookshop.searchengine.CimLucenePdfDocument.a
    ddContent(CimLucenePdfDocument.java:241)
    at
    gov.nsw.oit.bookshop.searchengine.CimLucenePdfDocument.g
    etDocument(CimLucenePdfDocument.java:152)
    at
    gov.nsw.oit.bookshop.searchengine.PublicationDocument.getD
    ocument(PublicationDocument.java:66)
    at
    gov.nsw.oit.bookshop.searchengine.PublicationIndexer.addDoc
    ument(PublicationIndexer.java:341)
    at
    gov.nsw.oit.bookshop.searchengine.PublicationIndexer.indexPu
    blications(PublicationIndexer.java:228)
    at
    gov.nsw.oit.bookshop.searchengine.PublicationIndexer.main
    (PublicationIndexer.java:162)

    It's not a major issue for me at the moment as the PDFs do
    get indexed on subsequent indexing calls, however it would
    be better if it just worked first time around.

    Cheers for you help,
    Brendan

     
  • Ben Litchfield

    Ben Litchfield - 2004-10-27

    Logged In: YES
    user_id=601708

    When I run "java org.pdfbox.ExtractText tmp\ruh_english.pdf"
    I do not get the exception, please give the nightly release a
    try and let me know if you still have this issue.

    Ben

     
  • Brendan Walker

    Brendan Walker - 2004-10-27

    Logged In: YES
    user_id=1080046

    Hi Ben,

    I installed the nightly build (btw, prior to this I was using the
    latest version available on SourceForge, v0.6.7a) and still got
    the same problem.

    You'll notice that I'm not using the java
    org.pdfbox.ExtractText method to cause this error. I'm
    actually accessing this PDF using a class that is based on the
    LucenePDFDocument you provided. (I had to add a few extra
    fields such as number of pages, renaming your keys and
    modifying the getUid method. I could have extended but it
    was difficult given that I wanted to change the keys you
    used.) Other than these few minor things the
    CimLucenePdfDocument you see in the exception stack trace
    is the same as your class. I was getting the same problem
    when using you class (before I created my new class.)

    Line 241 in my class refers to the line saying
    stripper.writeText(pdfDocument, writer);

    I can send you a copy of CimLucenePdfDocument if you
    require it.

    Cheers for yuo help. Regards,
    Brendan

     
  • Ben Litchfield

    Ben Litchfield - 2004-11-19

    Logged In: YES
    user_id=601708

    Is this still a problem? I don't have a test case that
    reproduces this issue.

    Ben

     
  • Brendan Walker

    Brendan Walker - 2004-11-22

    Logged In: YES
    user_id=1080046

    G'day Ben,

    This problem was still occuring with the document I attached
    earlier. It was only occuring when trying to access the
    document over an internet connection, so maybe that had
    something to do with it.

    Unfortunately I'm no longer working for the company I was
    invovvled with this project so I cannot give you any more
    help right now.

    Best of luck,
    Brendan

     
  • Ben Litchfield

    Ben Litchfield - 2005-01-26
    • status: open --> closed-out-of-date
     
  • Ben Litchfield

    Ben Litchfield - 2005-01-26

    Logged In: YES
    user_id=601708

    I do not have a test case that exhibits this problem so I am
    closing this case, please reopen if this is still an issue.

    Ben

     
  • Michael Schuerig

    Logged In: YES
    user_id=1088404

    I think I've got the same or a very similar problem; see stack
    trace below. There are some things of note in my case

    (1) The NPE is only thrown for http-URLs

    (2) The exception is not always thrown. If I try often enough to
    access a document it will eventually succeed, sometimes taking
    20+ attempts.

    (3) When running in the debugger (Eclipse), everything works
    fine; no NPEs.

    Michael

    java.lang.NullPointerException
    at
    org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
    at
    org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
    at
    org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:131)
    at
    org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:180)
    at
    org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocument.java:261)
    at
    org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:221)

     
  • Ben Litchfield

    Ben Litchfield - 2005-02-16
    • status: closed-out-of-date --> open
     
  • Ben Litchfield

    Ben Litchfield - 2005-02-16

    Logged In: YES
    user_id=601708

    Alright, I reopened this case, maybe there is a parsing
    problem when using a URL. Not sure what it could be though
    off the top of my head, I just use an InputStream.

    Do you mean that it consistently fails when using a URL, and
    only sometimes works? If it is a public url can you give it to
    me, so I can test using the same scenario.

    Ben

     
  • Michael Schuerig

    Logged In: YES
    user_id=1088404

    No, I mean the opposite. After retrying often enough every URL
    will finally get processed correctly, thus I can give no specific URL
    as none fails consistently.

    Further downstream, the cause of the NPE is that PDPageNode
    is constructed with a null argument. I have no idea why this
    happens, and what's more, why it only happens intermittently.
    However, that doesn't mean the NPE occurs only rarely. I've
    used LucenePDFDocument to index a list of about 100 remote
    documents. It took around 30 passes through the list until finally
    each and every document was indexed; on each pass I got NPEs
    for up to 90% of the documents.

    As I wrote, when run in the Eclipse debugger, everything works
    flawlessly, not a single NPE or other exception. I have no idea,
    why this is. I've tried to put the thread to sleep between
    accessing the documents and also tried to GC'ing, but that kind
    of voodoo didn't change anything.

    I've had a look at how data is read from the InputStream and I
    can see no obvious(!) problem there, but I have a hunch that
    the problem is somewhere in the vicinity.

    Michael

     
  • Pavel Vojtechovsky

    Logged In: YES
    user_id=1275162

    The problem is that InputStream.read(byte b[], int off, int len)
    reads only number of bytes which are currently available in
    buffers. This number can be less then value of variable len. It
    is often case of Internet streams.
    The following code in PDFParser.parseObject()
    if( pdfSource.available() < 1000 )
    {
    byte[] data = new byte[ 1000 ];
    int amountRead = pdfSource.read( data );
    has problem with that.
    1. pdfSource.available() do not return number of bytes till end
    of stream
    2. when pdfSource.read( data ) reads e.g. 1 byte it does not
    mean that there is no more bytes!

    I used following solution:
    1. I have added following method into
    org.pdfbox.io.PushBackInputStream
    which assures that all available bytes are read. Method read
    is used quite often from several places and we should be
    sure that all available bytes are read.
    /**
    * read aLen till EOF
    */
    public int read(byte[] aB, int aOff, int aLen) throws
    IOException
    {
    int loTotalLength = super.read(aB, aOff, aLen);
    if(loTotalLength==-1)
    { //EOF
    return -1;
    }
    while(loTotalLength<aLen)
    {
    int loL = super.read(aB,
    aOff+loTotalLength, aLen-loTotalLength);
    if(loL==-1)
    { //EOF
    break;
    }
    loTotalLength+=loL;
    }
    return loTotalLength;
    }

    2. I modified PDFParser.parseObject() this way:
    ...
    int amountRead = pdfSource.read( data );
    if( amountRead != -1 )
    {
    pdfSource.unread( data, 0, amountRead );
    }
    if(amountRead<1000)
    {//if there is less then 1000 bytes then it can be really end of
    file
    boolean atEndOfFile = true;
    ...

     
  • Ben Litchfield

    Ben Litchfield - 2006-01-28

    Logged In: YES
    user_id=601708

    This is now fixed, woo hoo!!

    Cheers,
    Ben

     
  • Ben Litchfield

    Ben Litchfield - 2006-01-28
    • status: open --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB