Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#4 jPod5 extracts text each letter separately, on some files

closed
mtraut
None
7
2010-06-25
2009-09-29
D G
No

I am using the following code to extract text from PDF documents:

protected void extractText(PDPageTree pageTree, StringBuilder sb) {
for (Iterator it = pageTree.getKids().iterator(); it.hasNext();) {
PDPageNode node = (PDPageNode) it.next();
if (node.isPage()) {
try {
CSTextExtractor extractor = new CSTextExtractor();
PDPage page = (PDPage) node;
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
interpreter.process(page.getContentStream(), page.getResources());
sb.append(extractor.getContent());
} catch (CSException ex) {
log.warn(TextExtractionError.NON_FATAL_ERROR, ex, "Error while extracting text from PDF document.");
}
} else {
extractText((PDPageTree) node, sb);
}
}
}

When I run it against my PDF document, I get the results where each letter is extracted as if it were a separate word. Is this a bug or is there a setting I can set to prevent this from happening? Thanks.

I can provide the PDF document separately. This UI won't let me attach docs > 256K.

Discussion

  • mtraut
    mtraut
    2009-09-30

    While i can not be sure without a look at the document, it seems that simply the heuristics in de.intarsys.pdf.content.text.CSTextExtractor.onCharacterFound(PDGlyphs, Rectangle2D) fail for some reason.

    Most often the reason is simply the document text not running in the hardcoded writing direction, which results in every character on a single line. This is a known missing feature.

    In your case we either have still a problem with the overall text matrix or the maxDx/maxDy values are not appropriate - perhaps just raise and retry?

    Perhaps you can upload the document via the cloud, "dropbox" for example?

     
  • D G
    D G
    2009-09-30

    I've emailed you separately as i think it may be easiest to exchange files via email.

    As far as the maxDX/maxDY thing, yes I saw those and wondered what makes the default values for these appropriate. In fact it seems there needs to be a more thorough algorithm which does not rely simply on hardcoded (or even configurable) numbers.

    Is it possible to analyze the PDF char data better to make better conclusions about where a given word or sentence begins and ends?

    The problem here is that yes, I can raise the values and retry and maybe even get good results for the particular document I am looking at. However, our software is a framework so it needs to work well with a very wide (unpredictable) range of PDF documents, therefore we need a generic solution to the issue.

    Thanks.

     
  • D G
    D G
    2009-09-30

    • priority: 5 --> 7
     
  • thank you - we will have a look at this.

    As far as the current implementation of CSTextExtractor is concerned. Surely there is a way to improve this implementation. But so far text extraction is not a key concern of the library, more of a "usable example" (that's the only excuse for the hardcoded constants) of the "hey, thats possible, too" kind. Let's see what we can do...

     
  • mtraut
    mtraut
    2009-10-05

    I'm sorry - i had a look at the document and it's exactly the kind thing we do not support. The text is written "from bottom to top" on a 90 degrees rotated page. We just do not take into account the page rotation in the total matrix until now...

    I have added page rotation and crop box to the CSDeviceAdapter now, so that extract and search should be able to reflect these settings. It will be included in the next release...

     
  • mtraut
    mtraut
    2009-10-05

    • status: open --> open-accepted
     
  • D G
    D G
    2009-10-05

    Michael,
    Thanks for the update and for the quick turnaround. When is the next release scheduled to come out?

     
  • D G
    D G
    2009-10-05

    • status: open-accepted --> open
     
  • mtraut
    mtraut
    2009-10-05

    2 months ago :-)))

    seriously, currently we provide the next release of CABAReT Stage and i will try to publish the jPod part shortly after ( < 4 weeks)

     
  • mtraut
    mtraut
    2010-06-25

    fixed

     
  • mtraut
    mtraut
    2010-06-25

    • assigned_to: nobody --> mtraut
    • status: open --> closed