How to detect that the PDF is non searchable?

  • mato

    mato - 2012-10-31

    Some PDF files are not searchable. When I run the file using JPOD I get the non readable output - (strange characters from ASCII with code less than 32 or in general totally unreadable outpout).

    I was able to detect some group of these documents with this piece of code

            protected void onCharacterFound(PDGlyphs glyphs, Rectangle2D rect) {
                CMap toUnicode = glyphs.getFont().getToUnicode();                       
                if (toUnicode == null) {
                    int decoded = glyphs.getFont().getEncoding().getDecoded(glyphs.getCodepoint());
                    if (decoded == -1) throw new InvalidCharacterException();

    But some documents pass this check.
    Is there any effective, guaranteed way how to find out that the PDF file is not searchable ?

  • mtraut

    mtraut - 2012-10-31

    IMHO there's no reliable way to detect non searchable PDF's. In the most basic case, code points are pointers into a font program. By reordering / subsetting the font program you can obfuscate anything without beeing noticed. 


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks