#2 CSTextExtract does not report visible text

closed
nobody
None
5
2009-07-13
2009-07-10
AwayFromTheSun
No

The given PDF contains text (which Adobe Acrobat calls "hidden text"). When using CSTextExtract, this hidden text, instead of the visible text is returned. I'd like to get both, hidden and visible text

Discussion

  • AwayFromTheSun
    AwayFromTheSun
    2009-07-10

    Hidden Text

     
    Attachments
  • mtraut
    mtraut
    2009-07-10

    Hmmm, i still can't get it (but i didn't compare character character). What text do you see that is not extracted?

    On a first inspection, there is no Tr 3 (invisible) text in the document. The "ghost" text may stem from text that is moved out of the visible area or is behind some other object.

     
  • AwayFromTheSun
    AwayFromTheSun
    2009-07-13

    Hi,

    sorry. The Acrobat Standard tricked me out on this. I compared the xtracted text and found out, that both, the text from the previous page and the one from the current page is in the PDF for this single page.

    So this isn't a jPod problem at all. I guess the problem is caused by iText which is used to split the documnent into single pages. Depending on the way, a PDF is build, it might put invisible or cropped text and graphics into the file.

    However, thanks for the fast feedback,
    best regards
    Andy

     
  • AwayFromTheSun
    AwayFromTheSun
    2009-07-13

    • status: open --> closed