jPod intarsys PDF library / Bugs / #2 jPod5 extracts text each letter separately, on some files

#2 jPod5 extracts text each letter separately, on some files

Status: closed-duplicate

Owner: mtraut

Labels: None

Priority: 5

Updated: 2009-09-30

Created: 2009-09-29

Creator: D G

Private: No

I am using the following code to extract text from PDF documents:

protected void extractText(PDPageTree pageTree, StringBuilder sb) {
for (Iterator it = pageTree.getKids().iterator(); it.hasNext();) {
PDPageNode node = (PDPageNode) it.next();
if (node.isPage()) {
try {
CSTextExtractor extractor = new CSTextExtractor();
PDPage page = (PDPage) node;
CSDeviceBasedInterpreter interpreter = new CSDeviceBasedInterpreter(null, extractor);
interpreter.process(page.getContentStream(), page.getResources());
sb.append(extractor.getContent());
} catch (CSException ex) {
log.warn(TextExtractionError.NON_FATAL_ERROR, ex, "Error while extracting text from PDF document.");
}
} else {
extractText((PDPageTree) node, sb);
}
}
}

When I run it against my PDF document, I get the results where each letter is extracted as if it were a separate word. Is this a bug or is there a setting I can set to prevent this from happening? Thanks.

I can provide the PDF document separately. This UI won't let me attach docs > 256K.

Discussion

D G - 2009-09-29

Extraction summary

1.4.3+US+Chart+pack+latest+version+2.pdf.xml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-09-30

duplicate of 2869991

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-09-30

assigned_to: nobody --> mtraut

status: open --> closed-duplicate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-09-30

This artifact has been marked as a duplicate of artifact 2869991 with reason:
No explanation provided.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jPod5 extracts text each letter separately, on some files

Group

Searches

Help

#2 jPod5 extracts text each letter separately, on some files

Discussion