Menu

#6 Error when extracting text

closed-fixed
mtraut
None
5
2010-11-06
2010-11-05
No

When extracting text from the following file:

http://arxiv.org/pdf/astro-ph/0702300

an error
de.intarsys.pdf.content.CSError: unexpected exception
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:212)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.process(CSDeviceBasedInterpreter.java:195)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:67)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:82)
at de.intarsys.pdf.example.extract.text.ExtractText.run(ExtractText.java:108)
at de.intarsys.pdf.example.extract.text.ExtractText.main(ExtractText.java:49)
Caused by: java.lang.NullPointerException
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:485)
at java.lang.StringBuilder.append(StringBuilder.java:184)
at de.intarsys.pdf.content.text.CSTextExtractor.append(CSTextExtractor.java:40)
at de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(CSTextExtractor.java:70)
at de.intarsys.pdf.content.text.CSCharacterParser.basicTextShowGlyphs(CSCharacterParser.java:68)
at de.intarsys.pdf.content.CSBasicDevice.textShow(CSBasicDevice.java:480)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.render_TJ(CSDeviceBasedInterpreter.java:781)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:235)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:206)
... 5 more
de.intarsys.pdf.content.CSError: unexpected exception
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:212)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.process(CSDeviceBasedInterpreter.java:195)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:67)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:82)
at de.intarsys.pdf.example.extract.text.ExtractText.run(ExtractText.java:108)
at de.intarsys.pdf.example.extract.text.ExtractText.main(ExtractText.java:49)
Caused by: java.lang.NullPointerException
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:485)
at java.lang.StringBuilder.append(StringBuilder.java:184)
at de.intarsys.pdf.content.text.CSTextExtractor.append(CSTextExtractor.java:40)
at de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(CSTextExtractor.java:70)
at de.intarsys.pdf.content.text.CSCharacterParser.basicTextShowGlyphs(CSCharacterParser.java:68)
at de.intarsys.pdf.content.CSBasicDevice.textShow(CSBasicDevice.java:480)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.render_TJ(CSDeviceBasedInterpreter.java:781)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:235)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:206)
... 5 more
de.intarsys.pdf.content.CSError: unexpected exception
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:212)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.process(CSDeviceBasedInterpreter.java:195)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:67)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:82)
at de.intarsys.pdf.example.extract.text.ExtractText.run(ExtractText.java:108)
at de.intarsys.pdf.example.extract.text.ExtractText.main(ExtractText.java:49)
Caused by: java.lang.NullPointerException
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:485)
at java.lang.StringBuilder.append(StringBuilder.java:184)
at de.intarsys.pdf.content.text.CSTextExtractor.append(CSTextExtractor.java:40)
at de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(CSTextExtractor.java:70)
at de.intarsys.pdf.content.text.CSCharacterParser.basicTextShowGlyphs(CSCharacterParser.java:68)
at de.intarsys.pdf.content.CSBasicDevice.textShow(CSBasicDevice.java:480)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.render_TJ(CSDeviceBasedInterpreter.java:781)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:235)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:206)
... 5 more
de.intarsys.pdf.content.CSError: unexpected exception
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:212)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.process(CSDeviceBasedInterpreter.java:195)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:67)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:82)
at de.intarsys.pdf.example.extract.text.ExtractText.run(ExtractText.java:108)
at de.intarsys.pdf.example.extract.text.ExtractText.main(ExtractText.java:49)
Caused by: java.lang.NullPointerException
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:485)
at java.lang.StringBuilder.append(StringBuilder.java:184)
at de.intarsys.pdf.content.text.CSTextExtractor.append(CSTextExtractor.java:40)
at de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(CSTextExtractor.java:70)
at de.intarsys.pdf.content.text.CSCharacterParser.basicTextShowGlyphs(CSCharacterParser.java:68)
at de.intarsys.pdf.content.CSBasicDevice.textShow(CSBasicDevice.java:480)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.render_TJ(CSDeviceBasedInterpreter.java:781)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:235)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:206)
... 5 more
de.intarsys.pdf.content.CSError: unexpected exception
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:212)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.process(CSDeviceBasedInterpreter.java:195)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:67)
at de.intarsys.pdf.example.extract.text.ExtractText.extractText(ExtractText.java:82)
at de.intarsys.pdf.example.extract.text.ExtractText.run(ExtractText.java:108)
at de.intarsys.pdf.example.extract.text.ExtractText.main(ExtractText.java:49)
Caused by: java.lang.NullPointerException
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:485)
at java.lang.StringBuilder.append(StringBuilder.java:184)
at de.intarsys.pdf.content.text.CSTextExtractor.append(CSTextExtractor.java:40)
at de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(CSTextExtractor.java:70)
at de.intarsys.pdf.content.text.CSCharacterParser.basicTextShowGlyphs(CSCharacterParser.java:68)
at de.intarsys.pdf.content.CSBasicDevice.textShow(CSBasicDevice.java:480)
at de.intarsys.pdf.content.CSDeviceBasedInterpreter.render_TJ(CSDeviceBasedInterpreter.java:781)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:235)
at de.intarsys.pdf.content.CSInterpreter.process(CSInterpreter.java:206)
... 5 more

appears

(this bug seems to be similar to an old, closed bug btu I was unable to repen the previous issue)

Discussion

  • Piotr Praczyk

    Piotr Praczyk - 2010-11-05

    Error appears when opening this file with exampel text extractor

     
  • Piotr Praczyk

    Piotr Praczyk - 2010-11-05

    font.getNextGlyphsEncoded(is) seems to return an object that gives null as a result of getChars

    the operator that is rendered is TJ

    byte[] text == {126} (in the call of CSBasicDevice::textShow()

     
  • Piotr Praczyk

    Piotr Praczyk - 2010-11-05

    another file causing problems : http://arxiv.org/pdf/astro-ph/0702300v1
    (actually the 126 value appears when debugging using this file)

     
  • mtraut

    mtraut - 2010-11-06

    Please help us by not mixing up cases - the file attached seems in no way related to the bug...

    The file causing the NPE is a bit strange and i found no advice in the spec how to deal with this case. The text extraction is mapping is defined using a ToUnicode map that is incomplete - character 126 / 7e is not contained in the map. I found no hint how to deal with this case in the text extraction direction, so in upcoming releases we simply replace this with notdef or space.

    Even in the difference encoding where the glyph name "vector" is used we find no hint. "vector" is not a well known glyph name we can work with.

    A valid workaround for you is to check for null and replace with space...

     
  • mtraut

    mtraut - 2010-11-06
    • status: open --> closed-fixed
     
  • mtraut

    mtraut - 2010-11-06

    fix in next release

     
  • mtraut

    mtraut - 2010-11-06
    • assigned_to: nobody --> mtraut
     

Log in to post a comment.