Donate Share

PDFBox

Tracker: Bugs

5 CMap parse fails during text extract - ID: 1702313
Last Update: Comment added ( nobody )

Unfortunately I cannot supply the PDF file. Any suggestion appreciated.

Exception in thread "main" java.io.IOException: Error: expected the end of
a dictionary.
at org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:220)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:79)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:535)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
...


Matt Hillsdon ( matthillsdon ) - 2007-04-17 16:21

5

Open

None

Ben Litchfield

parsing

v1.0 (example)

Public


Comments ( 16 )




Date: 2008-05-29 11:11
Sender: nobody

Logged In: NO

Hi both, any progress on this?

Thanks,
Ben


Date: 2008-04-29 18:30
Sender: matthillsdon


I've attached the two CMap streams that prevent text-extract for my PDF.
ExtractFonts didn't find them as they are resources of PDXObjectForm
objects rather than pages.

Perhaps the PDF creation software is at fault? Ben, can you point me to
the relevant specification? It would be good to cope anyway though if
there is a reasonable approach.

There are two issues:
1) CR in seemly incorrect places e.g. <0000\r>
2) beginbfchar<0000\r> - missing whitespace caused misparse.

Not so nice patch to work-around / illustrate these issue attached.
File Added: bug1702313-1.patch


Date: 2008-04-29 18:17
Sender: matthillsdon


File Added: SWPRNU+Myriad-Bold-Identity-H.cmap


Date: 2008-04-29 18:16
Sender: matthillsdon


File Added: RWPRNU+Univers-Light.cmap


Date: 2008-04-24 14:42
Sender: bmk06


Hi, I've recently come across exactly the same error when attempting to
extract text from a certain PDF. Has there been any progress fixing it? I'm
using pdfbox 0.7.3 and fontbox 0.1.0.

Hope you can help, thanks.

Ben Kirby
kirby.bm@gmail.com


Date: 2007-05-11 12:30
Sender: matthillsdon


Sorry for the delay. Updated extract output at

http://www.hillsdon.net/CMapDocument3.pdf

Stack trace for text extract as before:

Exception in thread "main" java.io.IOException: Error: expected the end of
a dictionary.
at
org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
...

Thanks, Matt.


Date: 2007-05-11 00:22
Sender: benlitchfieldProject Admin


Hi Matt,

any update?

Ben


Date: 2007-05-03 01:00
Sender: benlitchfieldProject Admin


ok, I looked at it some more and I'd like to have you get the latest
nightly build and try to run ExtractText on your original PDF again. If it
doesn't work then run the ExtractFonts again(using the nightly build) and
post the results.

The issue is that there is some extra data at the end of the Cmap stream
and tonight I happened to fix an issue with parsing and having extra data
at the end of the stream for a different user. So I don't know if this is
the same issue but I'd rather have you try the nightly build than have me
chasing a ghost.

Ben


Date: 2007-05-02 15:59
Sender: matthillsdon


Output with the decryption here
http://www.hillsdon.net/CMapDocument2.pdf

Thanks.


Date: 2007-05-01 23:19
Sender: benlitchfieldProject Admin


shoot, I think your document was encrypted. It needs to be decrypted for
the extraction to work, I should have had that as part of the program. Can
you take the attached program and add the lines after the PDDocument.load
call

if( doc.isEncrypted() )
{
doc.decrypt( "" );
}

and resend the CMapDocument.pdf

Thanks,
Ben


Date: 2007-04-30 15:24
Sender: matthillsdon


Result too large to attach. Please see
http://www.hillsdon.net/CMapDocument.pdf


Date: 2007-04-27 00:58
Sender: benlitchfieldProject Admin


Attached is a simple java program that will create a new pseudo PDF
document that contains just the Font information. Please run it on the
problem PDF and upload the resulting CmapDocument.pdf

It is a simple command line program, first compile then run it like this

java ExtractFonts my.pdf

Let me know if you have any questions getting it running.

Ben
File Added: ExtractFonts.java


Date: 2007-04-25 15:39
Sender: matthillsdon


No change unfortunately - with FontBox-0.2.0-dev-20070424 the stack trace
is identical.
Exception in thread "main" java.io.IOException: Error: expected the end of
a dictionary.
at
org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
...




Date: 2007-04-24 00:38
Sender: benlitchfieldProject Admin


I just update the CMapParser with a bug from

https://sourceforge.net/forum/message.php?msg_id=4269559

please get tonights FontBox build and give it a try

http://www.fontbox.org/fontbox


Date: 2007-04-18 13:53
Sender: matthillsdon


Hi Ben, thanks for the quick response.

Using the nightly build [1] the stack trace is the same except for line
numbers:

Exception in thread "main" java.io.IOException: Error: expected the end of
a dictionary.
at
org.fontbox.cmap.CMapParser.parseNextToken(CMapParser.java:269)
at org.fontbox.cmap.CMapParser.parse(CMapParser.java:117)
at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
at
org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at
org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
...

Extracting the fonts sounds ideal.

[1] http://www.pdfbox.org/dist/PDFBox-0.7.4-dev-20070418.zip


Date: 2007-04-17 16:31
Sender: benlitchfieldProject Admin


Hi Matt,

Can you try one for me first; upgrade to the latest nightly build of
PDFBox( http://www.pdfbox.org/dist/ ) and see if this is still an issue.
There have been some changes to the CMAPParser.

If it is still an issue I think we can write a simple program to extract
just the fonts from your PDF and that should be enough for me to fix the
bug.

Ben


Log in to comment.




Attached Files ( 4 )

Filename Description Download
ExtractFonts.java A simple program to extract fonts and CMap streams Download
RWPRNU+Univers-Light.cmap Extracted unparsable cmap 1 Download
SWPRNU+Myriad-Bold-Identity-H.cmap Extracted unparsable cmap 2 Download
bug1702313-1.patch Not so nice patch Download

Changes ( 4 )

Field Old Value Date By
File Added 276314: bug1702313-1.patch 2008-04-29 18:30 matthillsdon
File Added 276312: SWPRNU+Myriad-Bold-Identity-H.cmap 2008-04-29 18:17 matthillsdon
File Added 276310: RWPRNU+Univers-Light.cmap 2008-04-29 18:16 matthillsdon
File Added 226802: ExtractFonts.java 2007-04-27 00:58 benlitchfield