Re: [Dspam-user] high level of missed ham, but all factors at 0.01000

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tue, Aug 25, 2009 at 10:27 PM, Steve<ste...@gm...> wrote:
>
> -------- Original-Nachricht --------
>> Datum: Tue, 25 Aug 2009 21:33:19 +0200
>> Von: Sven Karlsson <kar...@gm...>
>> An: Dsp...@li...
>> Betreff: [Dspam-user] high level of missed ham, but all factors at 0.01000

>>         X-NS-Message-Id*BD74, 0.01000
>>
> Uhh.. bad, bad, bad! I see to much HTML tags there. This is sure not DSPAM 3.9.0. Right?

No, 3.6.8 as you noted below. But isn't it strange that the most
significant tokens are at 0.01, and it is still considered spam?

>> Other strangeness: most factors displayed seems to be from the header,
>> such as month*day pairs (although not in this example). I would assume
>> that the email content would account for better indication of
>> ham/spam.
>>
> That is sure true but you probably use one of the Bayesian algorithms and they only use the most significant tokens (15 tokens and up but not endless up). If you want all tokens to be considered then you should use naïve as this would process all tokens.

Ok.

>> Even more strangeness: The "improbability drive" shows "1 in 151
>> chance of being ham" or "1 in 151 chance of being spam" in 95% of the
>> cases (of 2146 examined emails). I would expect a lot more variation
>> here. Does this indicate a problem?
>>
> YES! Something is not right with the statistical counters. Is that issue only on your setup or do you have other users having the same issue?

This was for all users.

>
>
>> The setup scenario is for about 1000 mailboxes, using a global user,
>> TOE training and initial corpus of about 5000 manually sorted
>> spam/ham. There is a central periodic TOE training done about once a
>> week for a sample of all messages, training the globaluser.
>>
> I don't understand this. What are you training once a week? New and fresh set of HAM/SPAM or the same manually sorted 5000 HAM/SPAM messages?

New email; one admin goes through a global mailbox and retrains the
obvious missed spam and hams. This means that not all FP/FN are
retrained, but it should be OK since its TOE training (even though
some accuracy is lost). It also means that training may be focused on
for example certain days of the week (the admin doing the training is
more alert when starting at the monday emails, but may stop training
at wednesday emails, leaving thursday-sunday untrained. This may give
an unfair balance I assume.

>
>
>> Algorithm graham burton
>>
> AHA! So there we are. That's the reason for the reduced amount of tokens on the show factors output. This is btw nothing bad. It's not necessarily needed to process all tokens to get a good result.

Ok.

>
>
>> PValue graham
>>
> Uhh... if you have that in PValue then this must be DSPAM 3.6.8 or less. Am I right?
>
>
>> libmysql_drv storage driver
>>
>> Using dspam 3.6.8 shipped with Debian.
>>
> Aha. Yes. I was right. DSPAM 3.6.8. Have you considered updating your DSPAM setup? 3.8.0 at least. DSPAM 3.6.8 does not offer you much to improve your situation you currently are facing.

Can 3.8.0 be used in production? I was thinking of moving directly to
3.9.0, but I'm unsure about the stability.... Users are already
calling and complaining about ham ending up in the spamboxes :)

> Beside the 3.6.8 version of DSPAM? Not much (if at all). From what I see above you can't much improve your situation with 3.6.8.
>
>
>> Any way to debug the factors/tokens?
>>
> Debug in what way?

Such as why tokens with 0.01 probability end up as spam (or maybe I
dont understand this correctly, but I've seen the v*gra tokens having
like 0.96 probability, which is more understandable..).

Maybe there is some problem with the global group/user handling? (i.e.
users are normally not training themselves.)
Should retraining be done with dspam --user globaluser  or no user
setting at all? (only using the uid in the signature).

I have also tried to first do a reclassification with source=error,
and also tried retraining them instead as corpus, after removing the
previous dspam header and signature data. Maybe this has a negative
impact on the statistics?

BR,
 Sven