From: Bill Y. <ws...@me...> - 2004-11-20 12:54:48
|
From: cr...@ne... I'm using version 20040627-BlameSeifkes with procmail and nmh/mh-e on my own Linux box. I recently had a big burst of misclassified mail; accuracy declined dramatically, though with manual (TOE?) retraining it seems to be improving. This leads me to a few questions: 1) Any ideas what might have caused this problem? I'm wondering if I perhaps manually misclassified a single message; would that dramatically decrease accuracy? Possibly. It depends on the prior background. But you may also be seeing what I call an "error storm"; after weeks to months of perfect performance, I get an error. I train the error, and then for a week I'll get a greatly decreased accuracy, which slowly returns, then I'm back to months of perfection. Then the process repeats. I *think* it's due to a relaxation effect in the classifiers. I am NOT sure about this, but I can say that the first through third error storms are the worst and after a dozen of them, they seem to have stopped. 1) Can I improve my accuracy by switching to a later (and presumably scarier) version of CRM? Not really... unless you want to really go bleeding-edge and give Fidelis' patch for OSBF classification a try. THAT might get you quite an accuracy boost- but be warned, you MUST use thickness-based training with the Fidelis OSBF classifier. Basically, if any message (good _or_ spam) scores within +/- 20 pR points of 0, you force-train the message anyway. Mailfilter.crm is not set up to do this yet. 3) Should I be doing some sort of periodic bulk retraining every evening on known spam and known ham? If so, does anyone have any suggestions as how best to do this? That's called "TUNE" training, it will increase your accuracy if you want to try it. What's your goal? Are you just looking for usable email, or are you a "numbers runner", trying to beat four-nines performance? :-) -Bill Yerazunis |