Skip hidden content

Help
WitkOO
2011-08-19
2013-05-28
  • WitkOO
    WitkOO
    2011-08-19

    Hi,
    is there a way to skip reading of hidden text in document? I do PDF text extraction and the hidden stuff is present and i want to get rid of it.

    Thank you very much!

     
  • mtraut
    mtraut
    2011-08-19

    There's no predefined Parameter to extraction to do so, but jPod is a programming lib anyway.

    So, the way to do it (While i didn't test this in real world) should look like a (anonymous) subclass to CSTextExtractor with "onCharacterFound" redefined like

    <code>
    protected void onCharacterFound(PDGlyphs glyphs, Rectangle2D rect) {
    if (textState.renderingMode == TextState.RENDERING_MODE_NONE) {
    return;
    }
    super.onCharacterFound(glyphs, rect);
    }
    </code>

     
  • WitkOO
    WitkOO
    2011-08-22

    Hi,
    thanks but this does not work. Maybe im wrong a bit - the data i dont want to show looks like edit notes etc. e.g. Not Printed or such.
    Otherwise i tried to collect rendering modes from various pdfs and mos of them had 0, one had 0 and 2, one had only 3 and one had none, which is a bit suprising to me. Maybe you have some other ideas

    Thanks a lot!

     
  • mtraut
    mtraut
    2011-08-22

    Well, there's no lack of ideas, only time.

    Without the document there's no way to determine the specific way to "hide" the text. Common options are:

    - White text on white background
    - Print text outside of CropBox or some clipping region
    - Print text and put another opaque object on it

    Examine your content. The first two should be easy to filter in the same way as the example given above. Check the text state against the current character found event.

    The last one will give you a harder time…