Menu

#2 CSTextExtract does not report visible text

closed
nobody
None
5
2009-07-13
2009-07-10
No

The given PDF contains text (which Adobe Acrobat calls "hidden text"). When using CSTextExtract, this hidden text, instead of the visible text is returned. I'd like to get both, hidden and visible text

Discussion

  • AwayFromTheSun

    AwayFromTheSun - 2009-07-10

    Hidden Text

     
  • mtraut

    mtraut - 2009-07-10

    Hmmm, i still can't get it (but i didn't compare character character). What text do you see that is not extracted?

    On a first inspection, there is no Tr 3 (invisible) text in the document. The "ghost" text may stem from text that is moved out of the visible area or is behind some other object.

     
  • AwayFromTheSun

    AwayFromTheSun - 2009-07-13

    Hi,

    sorry. The Acrobat Standard tricked me out on this. I compared the xtracted text and found out, that both, the text from the previous page and the one from the current page is in the PDF for this single page.

    So this isn't a jPod problem at all. I guess the problem is caused by iText which is used to split the documnent into single pages. Depending on the way, a PDF is build, it might put invisible or cropped text and graphics into the file.

    However, thanks for the fast feedback,
    best regards
    Andy

     
  • AwayFromTheSun

    AwayFromTheSun - 2009-07-13
    • status: open --> closed
     

Log in to post a comment.