#453 Extreme memory usage while extracting text from one pdf

closed-out-of-date
5
2010-04-07
2007-08-08
James W.
No

When I use a -Xmx value of 64m or 128m the following exception is thrown:

Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding.set(StringCoding.java:53)
at java.lang.StringCoding.decode(StringCoding.java:171)
at java.lang.String.<init>(String.java:444)
at java.lang.String.<init>(String.java:516)
at org.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:467)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:201)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.addContent(NmdLucenePDFDocument.java:433)
at us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.convertDocument(NmdLucenePDFDocument.java:292)
at us.fed.nmcourt.drs.daemonmanager.handler.AbstractDaemonManagerHandler.writeIncomingDocumentsToIndex(AbstractDaemonManagerHandler.java:355)
at us.fed.nmcourt.drs.daemonmanager.handler.CaddHandler.handle(CaddHandler.java:112)
at us.fed.nmcourt.drs.daemonmanager.MainDaemonRunnable.run(MainDaemonRunnable.java:367)
at java.lang.Thread.run(Thread.java:619)

When I use a -Xmx value of 256m it works. Is there any way that you can reduce the memory requirement for extracting the text from this pdf.

The pdf is only 1.1MB in size.

Arhhhh. The pdf is larger than 256KB so it cannot be attached below. I'll email it to you if you give your email address.

Thanks in advance,

James
jwilson@nmcourt.fed.us

Discussion

  • Ben Litchfield

    Ben Litchfield - 2007-08-08

    Logged In: YES
    user_id=601708
    Originator: NO

    Yes, please upload(ftp.pdfbox.org) or email(ben@benlitchfield.com) me the PDF, just add a quick note here with the filename.

    Thanks,
    Ben

     
  • James W.

    James W. - 2007-08-22

    Logged In: YES
    user_id=1832000
    Originator: YES

    Where using the server vm I have to bump up the value for -Xmx to 512m.

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07
    • status: open --> closed-out-of-date
     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07

    PDFBox has moved to Apache. Bugs have been moved over to the Apache bug tracking system. If you don't see the bug and it's still not fixed in the current release then please create a new bug on the Apache site.

    http://pdfbox.apache.org

     

Log in to post a comment.