> just to chip in, training on other's spam is a BAD idea.
> on my system i had a central database trained after my spam/ham. when i
> first put the system in place EVERYONE (50-100 users) were getting 100%
> of their emails flagged as spam because of that database.
> just don't do it :)
I have an existence proof that says otherwise :-)
You do have to be careful when you do this. For example, I trained
once with a pile of mail collected in one month that was all spam,
and all it did was cause any mail received in that month to be marked
When I trained my university feed with the data from my home system,
I first edited all "gtoal.com" headers to read "panam.edu", then all
instances of my username "gtoal" to be "zzzzzzz" - and likewise all
occurances of my proper name in text.
I've been toying with this idea for bulk pre-training: first, collect
a few days of mail and feed it *all* to spamrobe as good mail - regardless
of whether it is good or bad - then feed the pre-existing database of
bad mail to it, to correct the errors from training as good. However
the truly good mail will remain in the database scored appropriately.
I think this will reply on having a large enough sample of new mail,
but a *much* larger sample of spam with which to override the initial
One thing about this lark, it's easy to experiment once you have a
large enough corpus of known good and known bad mails. Also keeping
everything like this helps significantly with regression testing, which
is an absolute must. I regularly now look for 'good' items in my
bad database, then feed them through as spam again to correct any
errors; and likewise for good mails. I guess theoretically I could
end up with an unstable situation flipping from one to the other
but it hasn't happened yet.
A related idea: passing around trained data dumped from the database.
Does anyone have any good suggestions for sanitising this data to make
Another trick I've tried recently is dumping the database and
extracting common spam domains from it. This has to be done
carefully to eliminate any domains which have legitimate senders.
(I'm using this list in my rule-based pre-training system)
Note that if a spammer uses some names like mx01.spamdomain.com,
mx02.spamdomain.com etc etc, spamprobe will not train on spamdomain.com
itself - it uses a greedy algorithm to define FQDNs as a single word.