Menu

#2 jPod5 extracts text each letter separately, on some files

closed-duplicate
mtraut
None
5
2009-09-30
2009-09-29
D G
No

I am using the following code to extract text from PDF documents:

protected void extractText(PDPageTree pageTree, StringBuilder sb) {
for (Iterator it = pageTree.getKids().iterator(); it.hasNext();) {
PDPageNode node = (PDPageNode) it.next();
if (node.isPage()) {
try {
CSTextExtractor extractor = new CSTextExtractor();
PDPage page = (PDPage) node;
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
interpreter.process(page.getContentStream(), page.getResources());
sb.append(extractor.getContent());
} catch (CSException ex) {
log.warn(TextExtractionError.NON_FATAL_ERROR, ex, "Error while extracting text from PDF document.");
}
} else {
extractText((PDPageTree) node, sb);
}
}
}

When I run it against my PDF document, I get the results where each letter is extracted as if it were a separate word. Is this a bug or is there a setting I can set to prevent this from happening? Thanks.

I can provide the PDF document separately. This UI won't let me attach docs > 256K.

Discussion

  • D G

    D G - 2009-09-29

    Extraction summary

     
  • mtraut

    mtraut - 2009-09-30

    duplicate of 2869991

     
  • mtraut

    mtraut - 2009-09-30
    • assigned_to: nobody --> mtraut
    • status: open --> closed-duplicate
     
  • mtraut

    mtraut - 2009-09-30

    This artifact has been marked as a duplicate of artifact 2869991 with reason:
    No explanation provided.

     

Log in to post a comment.