PDFBox / Bugs / #390 spaces lost

#390 spaces lost

Status: closed-out-of-date

Owner: Ben Litchfield

Labels: text extraction (148)

Priority: 1

Updated: 2010-04-07

Created: 2007-01-15

Creator: tweakerbee

Private: No

During extraction in certain PDF documents spaces will be lost. I have attached a file in which this problem occurs.

Here PDFTextStripper.getText() returns:
gaandeofincidenteleaardis
whereas it should be
gaande of incidentele aard is

I have used the nightly build from today (15-01-07) but the problem still remains.

Discussion

tweakerbee - 2007-01-15

document with erronous text extraction

STB336.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tweakerbee - 2007-01-16

Logged In: YES
user_id=1625706
Originator: YES

I am currently looking into the problem myself as well, but my complete lack of experience with the Portable Document Format as well as being a novice Java programmer are rather limiting.

What I have found out so far is this:
The problem is in the TextStream where a TJ operator is being used to show the glyphs. There are no spaces encoded in the file, but instead it uses some character spacing information to space out the words. An example is included below.
The code I believe is responsible for extracting the text here (org.pdfbox.util.operator.ShowTextGlyph) does not contain any code to determine whether or not a space is needed. Would it be useful to add this here? And will this not breakdown the org.pdfbox.util.PDFHighlighter? (I have noticed some difficulties with certain PDF documents and I wouldn't be surprised if the difference in character count originates from this issue.)

Any help would be greatly appreciated.

Example code in STB336.pdf:
[(7 )-278()-278( "&)-278()-278()-278( \))-278()-278(\) \012)-278( '&)-278()]TJ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tweakerbee - 2007-01-16

Logged In: YES
user_id=1625706
Originator: YES

My previous assumption turned out to be incorrect.
The context.showString() function is responsible for outputting the string. If anywhere, it should probably output the space here.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tweakerbee - 2007-01-16

priority: 5 --> 1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tweakerbee - 2007-01-16

Logged In: YES
user_id=1625706
Originator: YES

The problem turned out to be in the splitting algorithm. The values here turned out slightly too conservative.
Using 0.33f (33%) turned out to yield proper results. This might split words that are not meant to be split, however.

Maybe you could set this through a field in the TextStripper? So you can adjust your application slightly easier to your specific needs.

This issue can be considered solved.

startOfNextWordX = endOfLastTextX + (wordSpacing* 0.33f);
startOfNextWordX = endOfLastTextX + (((wordSpacing+lastWordSpacing)/2f)* 0.33f);

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Litchfield - 2010-04-07

PDFBox has moved to Apache. Bugs have been moved over to the Apache bug tracking system. If you don't see the bug and it's still not fixed in the current release then please create a new bug on the Apache site.

http://pdfbox.apache.org

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ben Litchfield - 2010-04-07

status: open --> closed-out-of-date
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.