jPod intarsys PDF library / Feature Requests / #4 jPod5 extracts text each letter separately, on some files

#4 jPod5 extracts text each letter separately, on some files

Status: closed

Owner: mtraut

Labels: None

Priority: 7

Updated: 2010-06-25

Created: 2009-09-29

Creator: D G

Private: No

I am using the following code to extract text from PDF documents:

protected void extractText(PDPageTree pageTree, StringBuilder sb) {
for (Iterator it = pageTree.getKids().iterator(); it.hasNext();) {
PDPageNode node = (PDPageNode) it.next();
if (node.isPage()) {
try {
CSTextExtractor extractor = new CSTextExtractor();
PDPage page = (PDPage) node;
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
interpreter.process(page.getContentStream(), page.getResources());
sb.append(extractor.getContent());
} catch (CSException ex) {
log.warn(TextExtractionError.NON_FATAL_ERROR, ex, "Error while extracting text from PDF document.");
}
} else {
extractText((PDPageTree) node, sb);
}
}
}

When I run it against my PDF document, I get the results where each letter is extracted as if it were a separate word. Is this a bug or is there a setting I can set to prevent this from happening? Thanks.

I can provide the PDF document separately. This UI won't let me attach docs > 256K.

Discussion

D G - 2009-09-29

Extraction summary

1.4.3+US+Chart+pack+latest+version+2.pdf.xml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-09-30

While i can not be sure without a look at the document, it seems that simply the heuristics in de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(PDGlyphs, Rectangle2D) fail for some reason.

Most often the reason is simply the document text not running in the hardcoded writing direction, which results in every character on a single line. This is a known missing feature.

In your case we either have still a problem with the overall text matrix or the maxDx/maxDy values are not appropriate - perhaps just raise and retry?

Perhaps you can upload the document via the cloud, "dropbox" for example?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

D G - 2009-09-30

I've emailed you separately as i think it may be easiest to exchange files via email.

As far as the maxDX/maxDY thing, yes I saw those and wondered what makes the default values for these appropriate. In fact it seems there needs to be a more thorough algorithm which does not rely simply on hardcoded (or even configurable) numbers.

Is it possible to analyze the PDF char data better to make better conclusions about where a given word or sentence begins and ends?

The problem here is that yes, I can raise the values and retry and maybe even get good results for the particular document I am looking at. However, our software is a framework so it needs to work well with a very wide (unpredictable) range of PDF documents, therefore we need a generic solution to the issue.

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

D G - 2009-09-30

priority: 5 --> 7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-09-30

thank you - we will have a look at this.

As far as the current implementation of CSTextExtractor is concerned. Surely there is a way to improve this implementation. But so far text extraction is not a key concern of the library, more of a "usable example" (that's the only excuse for the hardcoded constants) of the "hey, thats possible, too" kind. Let's see what we can do...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-10-05

I'm sorry - i had a look at the document and it's exactly the kind thing we do not support. The text is written "from bottom to top" on a 90 degrees rotated page. We just do not take into account the page rotation in the total matrix until now...

I have added page rotation and crop box to the CSDeviceAdapter now, so that extract and search should be able to reflect these settings. It will be included in the next release...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-10-05

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

D G - 2009-10-05

Michael,
Thanks for the update and for the quick turnaround. When is the next release scheduled to come out?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

D G - 2009-10-05

status: open-accepted --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-10-05

2 months ago :-)))

seriously, currently we provide the next release of CABAReT Stage and i will try to publish the jPod part shortly after ( < 4 weeks)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2010-06-25

fixed

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2010-06-25

assigned_to: nobody --> mtraut

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.