Menu

#64 note runs of short words

open
None
5
2006-04-23
2006-04-23
No

I've recently been seeing a lot of pharmacy spam with few, if any, clues.
The message bodies look like this:

X j A m N j A d X h
M k E z R d I p D u I m A c
C o I d A t L j I v S j
A j M w B p I q E s N p
V a I a A g G z R j A f
S b O n M u A p
V f A g L m I h U q M b
S u A u V o E q n O t V y E n R d r 7 b 0 k % d x W c I n T p H d u O s
U h R i b S s H q O p P h ! k

http://www.chilreanno.com

Followed by some drivel meant to boost "good" words. The URL
changes
frequently, and like most spam, it seems to come from all over the
place,
so there are very few clues present for SpamBayes to munch on.

The attached patch pays attention to runs of words which are too
short for other consideration and emits a token that's the base 2
log of the longest run of such words seen in the message. The result
seems to add an extra useful structural token to the mix and makes
these particular types of spam less likely to score unsure.

I didn't just check it in for a couple reasons. One, I was targetting just
a single kind of message. I'm not anxious to get into a
SpamAssassin-type escalation of, "hey, this kind of message does
this, let's try that", sort of thing. I'd prefer it if the concept was
applicable to a broader variety of spams. Two, I no longer have any
sort of test database other than my current personal collection of ham
and spam (between 300 and 400 messages), so I can't really test it
properly to see if it's a net win.

Like I said, it seemed to help in this instance. Here's my collection
of short:* tokens:

token,nspam,nham,spam prob
short:7,3,0,0.934782608696
short:6,6,0,0.96511627907
short:5,2,1,0.5
short:4,3,2,0.366449889676
short:3,3,1,0.5
short:2,19,15,0.319154484346
short:1,196,69,0.5
short:0,63,25,0.5

My database is currently a bit unbalanced (5 spams for every 2 hams),
hence (I think) the unusual spamprobs.

Assigning to Tony just so someone has a chance to give it the once
over during the 1.1 alpha phase.

Skip

Discussion

  • Skip Montanaro

    Skip Montanaro - 2006-04-23
     
  • Tony Meyer

    Tony Meyer - 2006-05-05

    Logged In: YES
    user_id=552329

    Just so you know - I am looking at this. I have three
    corpora that I am testing with at the moment: the (fairly
    old and small) SpamAssassin Public Corpus, the 2005 TREC
    Corpus, and about 50,000 messages of my own mail from a few
    months back.

    I did a 'timcv.py -n5' with these three. There was
    basically no change with both the SA and TREC corpora. I
    suspect that's because those are made up of fairly old
    messages and this trick wasn't used much until recently.
    There was a very small win with my messages.

    My plan is to get a more recent collection of my mail and
    try it on that, as I've seen these much more in recent
    times, too, and I suspect it'll do better then.

     

Log in to post a comment.