jPod intarsys PDF library / Support Requests / #2 CSTextExtract does not report visible text

#2 CSTextExtract does not report visible text

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-07-13

Created: 2009-07-10

Creator: AwayFromTheSun

Private: No

The given PDF contains text (which Adobe Acrobat calls "hidden text"). When using CSTextExtract, this hidden text, instead of the visible text is returned. I'd like to get both, hidden and visible text

Discussion

AwayFromTheSun - 2009-07-10

Hidden Text

HiddenText.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mtraut - 2009-07-10

Hmmm, i still can't get it (but i didn't compare character character). What text do you see that is not extracted?

On a first inspection, there is no Tr 3 (invisible) text in the document. The "ghost" text may stem from text that is moved out of the visible area or is behind some other object.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AwayFromTheSun - 2009-07-13

Hi,

sorry. The Acrobat Standard tricked me out on this. I compared the xtracted text and found out, that both, the text from the previous page and the one from the current page is in the PDF for this single page.

So this isn't a jPod problem at all. I guess the problem is caused by iText which is used to split the documnent into single pages. Depending on the way, a PDF is build, it might put invisible or cropped text and graphics into the file.

However, thanks for the fast feedback,
best regards
Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

AwayFromTheSun - 2009-07-13

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CSTextExtract does not report visible text

Group

Searches

Help

#2 CSTextExtract does not report visible text

Discussion