SpamBayes anti-spam / Patches / #64 note runs of short words

#64 note runs of short words

Status: open

Owner: Tony Meyer

Labels: None

Priority: 5

Updated: 2006-04-23

Created: 2006-04-23

Creator: Skip Montanaro

Private: No

I've recently been seeing a lot of pharmacy spam with few, if any, clues.
The message bodies look like this:

X j A m N j A d X h
M k E z R d I p D u I m A c
C o I d A t L j I v S j
A j M w B p I q E s N p
V a I a A g G z R j A f
S b O n M u A p
V f A g L m I h U q M b
S u A u V o E q n O t V y E n R d r 7 b 0 k % d x W c I n T p H d u O s
U h R i b S s H q O p P h ! k

http://www.chilreanno.com

Followed by some drivel meant to boost "good" words. The URL
changes
frequently, and like most spam, it seems to come from all over the
place,
so there are very few clues present for SpamBayes to munch on.

The attached patch pays attention to runs of words which are too
short for other consideration and emits a token that's the base 2
log of the longest run of such words seen in the message. The result
seems to add an extra useful structural token to the mix and makes
these particular types of spam less likely to score unsure.

I didn't just check it in for a couple reasons. One, I was targetting just
a single kind of message. I'm not anxious to get into a
SpamAssassin-type escalation of, "hey, this kind of message does
this, let's try that", sort of thing. I'd prefer it if the concept was
applicable to a broader variety of spams. Two, I no longer have any
sort of test database other than my current personal collection of ham
and spam (between 300 and 400 messages), so I can't really test it
properly to see if it's a net win.

Like I said, it seemed to help in this instance. Here's my collection
of short:* tokens:

token,nspam,nham,spam prob
short:7,3,0,0.934782608696
short:6,6,0,0.96511627907
short:5,2,1,0.5
short:4,3,2,0.366449889676
short:3,3,1,0.5
short:2,19,15,0.319154484346
short:1,196,69,0.5
short:0,63,25,0.5

My database is currently a bit unbalanced (5 spams for every 2 hams),
hence (I think) the unusual spamprobs.

Assigning to Tony just so someone has a chance to give it the once
over during the 1.1 alpha phase.

Skip

Discussion

Skip Montanaro - 2006-04-23

tokenizer.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tony Meyer - 2006-05-05

Logged In: YES
user_id=552329

Just so you know - I am looking at this. I have three
corpora that I am testing with at the moment: the (fairly
old and small) SpamAssassin Public Corpus, the 2005 TREC
Corpus, and about 50,000 messages of my own mail from a few
months back.

I did a 'timcv.py -n5' with these three. There was
basically no change with both the SA and TREC corpora. I
suspect that's because those are made up of fairly old
messages and this trick wasn't used much until recently.
There was a very small win with my messages.

My plan is to get a more recent collection of my mail and
try it on that, as I've seen these much more in recent
times, too, and I suspect it'll do better then.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

note runs of short words

Group

Searches

Help

#64 note runs of short words

Discussion