#302 The CMapParser does not recognize essential cmap operators

closed-out-of-date
parsing (91)
5
2010-04-07
2006-02-24
No

The bug is directly related to the following bug I
discovered in the database:
[ 1208652 ] PDFTextStripper.writeText Exception:Unknown
encoding for ..

I'll try to exlain it again here and supply enough
resources for its fix.

The problem is that the current implementation of
CMapParser class supports only the beginbfchar and
beginbfrange operators.

This is not enough and causes the invokation to
PDFTextStripper.writeText() to throw IOException with
the following message: Unknown encoding for 'Identity-
V'.
I also managed to produce the message: "Unknown
encoding for '90ms-RKSJ-H'.

The complete stacktrace is:
java.io.IOException: Unknown encoding for 'Identity-V'
at org.pdfbox.encoding.EncodingManager.
getEncoding(EncodingManager.java:83)
at org.pdfbox.pdmodel.font.PDFont.
getEncoding(PDFont.java:627)
at org.pdfbox.pdmodel.font.PDFont.
encode(PDFont.java:476)
at org.pdfbox.util.PDFStreamEngine.
showString(PDFStreamEngine.java:332)
at org.pdfbox.util.operator.ShowText.
process(ShowText.java:66)
at org.pdfbox.util.PDFStreamEngine.
processOperator(PDFStreamEngine.java:494)
at org.pdfbox.util.PDFStreamEngine.
processSubStream(PDFStreamEngine.java:207)
at org.pdfbox.util.PDFStreamEngine.
processStream(PDFStreamEngine.java:160)
at org.pdfbox.util.PDFTextStripper.
processPage(PDFTextStripper.java:355)
at org.pdfbox.util.PDFTextStripper.
processPages(PDFTextStripper.java:268)
at org.pdfbox.util.PDFTextStripper.
writeText(PDFTextStripper.java:220)

In fact the cause of this exception is that the
CMapParser does not recognize the begincidchar and
begincidrange operators (in the case of the 90ms-RKSJ-
H) encoding and usecmap operator in the case of
Identity-V encoding.

The cmap files for these encodings are not properly
parsed and the corresponding Cmap objects do not
contain neither one nor two byte mappings, further the
lookup() method returns null.

I'll attach two samples for the 90ms-RKSJ-H encoding
and one for the Identity-V encoding.

I'll attach cmap reference also.

Discussion

  • Vladimir Dimchev

    The first 90ms-RKSJ-H sample

     
  • Vladimir Dimchev

    The second 90ms-RKSJ-H sample

     
  • Vladimir Dimchev

    The Identity-V sample

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07

    PDFBox has moved to Apache. Bugs have been moved over to the Apache bug tracking system. If you don't see the bug and it's still not fixed in the current release then please create a new bug on the Apache site.

    http://pdfbox.apache.org

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07
    • status: open --> closed-out-of-date
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks