  • WitkOO

    WitkOO - 2011-08-19

    is there a way to skip reading of hidden text in document? I do PDF text extraction and the hidden stuff is present and i want to get rid of it.

    Thank you very much!

  • mtraut

    mtraut - 2011-08-19

    There's no predefined Parameter to extraction to do so, but jPod is a programming lib anyway.

    So, the way to do it (While i didn't test this in real world) should look like a (anonymous) subclass to CSTextExtractor with "onCharacterFound" redefined like

    protected void onCharacterFound(PDGlyphs glyphs, Rectangle2D rect) {
    if (textState.renderingMode == TextState.RENDERING_MODE_NONE) {
    super.onCharacterFound(glyphs, rect);

  • WitkOO

    WitkOO - 2011-08-22

    thanks but this does not work. Maybe im wrong a bit - the data i dont want to show looks like edit notes etc. e.g. Not Printed or such.
    Otherwise i tried to collect rendering modes from various pdfs and mos of them had 0, one had 0 and 2, one had only 3 and one had none, which is a bit suprising to me. Maybe you have some other ideas

    Thanks a lot!

  • mtraut

    mtraut - 2011-08-22

    Well, there's no lack of ideas, only time.

    Without the document there's no way to determine the specific way to "hide" the text. Common options are:

    - White text on white background
    - Print text outside of CropBox or some clipping region
    - Print text and put another opaque object on it

    Examine your content. The first two should be easy to filter in the same way as the example given above. Check the text state against the current character found event.

    The last one will give you a harder time…


