POPFile - Automatic Email Classification / Discussion / Bleeding Edge

PDF parsing: any volunteers

John Graham-Cumming - 2007-07-06

Folks,

Anyone feel like writing a basic PDF parser so that we can do a better job on PDF spam. It turns out that many of these PDF spams have a bad XREF in them which leads me to think that a simple PDF parser with some associated pseudowords like we do for images would help.

John.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Texas Fett - 2007-07-14
  
  This article talks about PDF spam. They bring up a good point, PDFs can be huge so parsing the whole thing is really going to slow things down.
  
  http://www.internetnews.com/security/article.php/3688636
  
  Looking for bad XREFs, it might work for a while, but spammers will fix that. I think we can presume they are using free PDF creation software. Unless they are doing it on purpose they might not be the only ones creating PDFs with bad XREFs.
  
  I think we could use a pseudoword for PDF number of pages, as in 1, <=5, <=100, >100. All PDF spam I have seen so far is a single page. They must keep the spam small so sending more than a couple pages seems very unlikely. But the size of large PDFs may help in other classifications.
  
  We should look at the relative page size or aspect ratio as well. Much of the image only PDF spam has had measurements like 7x3" or 10x2". Those are unlikely to be legitimate PDFs. If we look for sizes that come close to standard paper sizes we should be able determine spam from real PDFs easier.
  
  PDFs have a bunch of other meta data to look at. The program that created it, the author, embedded fonts, embedded images, security restrictions, etc. Parsing the text would certainly be helpful in classifying for different buckets, but for now just dealing with the attributes is probably enough to catch most PDF spam.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

PDF parsing: any volunteers

Forums

Help

PDF parsing: any volunteers

PDF parsing: any volunteers

Forums

Help

PDF parsing: any volunteers document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

PDF parsing: any volunteers