How to detect that the PDF is non searchable?

Help
mato
2012-10-31
2013-05-28
  • mato

    mato - 2012-10-31

    Some PDF files are not searchable. When I run the file using JPOD I get the non readable output - (strange characters from ASCII with code less than 32 or in general totally unreadable outpout).

    I was able to detect some group of these documents with this piece of code

                    @Override
            protected void onCharacterFound(PDGlyphs glyphs, Rectangle2D rect) {
                CMap toUnicode = glyphs.getFont().getToUnicode();                       
                if (toUnicode == null) {
                    int decoded = glyphs.getFont().getEncoding().getDecoded(glyphs.getCodepoint());
                    if (decoded == -1) throw new InvalidCharacterException();
    

    But some documents pass this check.
    Is there any effective, guaranteed way how to find out that the PDF file is not searchable ?

     
  • mtraut

    mtraut - 2012-10-31

    IMHO there's no reliable way to detect non searchable PDF's. In the most basic case, code points are pointers into a font program. By reordering / subsetting the font program you can obfuscate anything without beeing noticed. 

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks