Menu

PDF parsing: any volunteers

2007-07-06
2013-04-15
  • John Graham-Cumming

    Folks,

    Anyone feel like writing a basic PDF parser so that we can do a better job on PDF spam.  It turns out that many of these PDF spams have a bad XREF in them which leads me to think that a simple PDF parser with some associated pseudowords like we do for images would help.

    John.

     
    • Texas Fett

      Texas Fett - 2007-07-14

      This article talks about PDF spam.  They bring up a good point, PDFs can be huge so parsing the whole thing is really going to slow things down.

      http://www.internetnews.com/security/article.php/3688636

      Looking for bad XREFs, it might work for a while, but spammers will fix that.  I think we can presume they are using free PDF creation software.  Unless they are doing it on purpose they might not be the only ones creating PDFs with bad XREFs.

      I think we could use a pseudoword for PDF number of pages, as in 1, <=5, <=100, >100.  All PDF spam I have seen so far is a single page.  They must keep the spam small so sending more than a couple pages seems very unlikely.  But the size of large PDFs may help in other classifications.

      We should look at the relative page size or aspect ratio as well.  Much of the image only PDF spam has had measurements like 7x3" or 10x2".  Those are unlikely to be legitimate PDFs.  If we look for sizes that come close to standard paper sizes we should be able determine spam from real PDFs easier.

      PDFs have a bunch of other meta data to look at.  The program that created it, the author, embedded fonts, embedded images, security restrictions, etc.  Parsing the text would certainly be helpful in classifying for different buckets, but for now just dealing with the attributes is probably enough to catch most PDF spam.

       

Log in to post a comment.