hang while text extranting

Help
jm
2008-01-09
2013-05-28
  • jm
    jm
    2008-01-09

    hi,

    firstable good work, text extraction works in some cases that pdfbox fails so iam using both now. I am using 3.4 but i have found a pdf that makes the process hang while in jpod code. Any exception will be fine (the pdf is corrupted apparently, but not 100% sure), but the library should never hang.

    I can provide the pdf is necessary,

    cheers

     
    • Waldemar Dick
      Waldemar Dick
      2008-01-09

      Hi,

      >firstable good work, text extraction works in some cases that pdfbox fails so iam using both now. I am >using 3.4 but i have found a pdf that makes the process hang while in jpod code. Any exception will be >fine (the pdf is corrupted apparently, but not 100% sure), but the library should never hang.

      First of all thank you for your feedback.

      >I can provide the pdf is necessary,

      Yes, please do so. To hunt down the bug I will need the PDF document. If the document is publicly available, then either post a link here or open a task in the bug tracker, where you can attach files. If the document contains private data, then please send it directly to me at 'sourceforge(at)devmue.de'.

      Thank you very much.

      wdick

       
    • Waldemar Dick
      Waldemar Dick
      2008-01-09

      Hi,

      the PDF document doesn't have a COS trailer, so, yes, it is corrupt. It looks like the last part of the file is missing.

      jPods document parser has a fall back mode, where it tries to find a COS trailer, if the position referenced by the 'startxref' entry is wrong. This fall back parser got caught in an endless loop while searching for the trailer object.

      I commited the fix to de.intarsys.pdf.parser.COSDocumentParser.java on the sourceforge CVS.

      Here is the URL to the diff: http://jpodlib.cvs.sourceforge.net/jpodlib/jpodlib/src/de/intarsys/pdf/parser/COSDocumentParser.java?r1=1.1&r2=1.2

      Thank you for reporting the bug!

      Greetings

      wdick