#4 jPod5 extracts text each letter separately, on some files

closed
mtraut
None
7
2010-06-25
2009-09-29
D G
No

I am using the following code to extract text from PDF documents:

protected void extractText(PDPageTree pageTree, StringBuilder sb) {
for (Iterator it = pageTree.getKids().iterator(); it.hasNext();) {
PDPageNode node = (PDPageNode) it.next();
if (node.isPage()) {
try {
CSTextExtractor extractor = new CSTextExtractor();
PDPage page = (PDPage) node;
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
interpreter.process(page.getContentStream(), page.getResources());
sb.append(extractor.getContent());
} catch (CSException ex) {
log.warn(TextExtractionError.NON_FATAL_ERROR, ex, "Error while extracting text from PDF document.");
}
} else {
extractText((PDPageTree) node, sb);
}
}
}

When I run it against my PDF document, I get the results where each letter is extracted as if it were a separate word. Is this a bug or is there a setting I can set to prevent this from happening? Thanks.

I can provide the PDF document separately. This UI won't let me attach docs > 256K.

Discussion

<< < 1 2 (Page 2 of 2)
  • mtraut
    mtraut
    2010-06-25

    fixed

     
  • mtraut
    mtraut
    2010-06-25

    • assigned_to: nobody --> mtraut
    • status: open --> closed
     
<< < 1 2 (Page 2 of 2)