From: Bill Y <ws...@me...> - 2006-02-03 16:03:28
|
Here's probably the best-tested CRM114 rev yet: 20060118-BlameTheReavers Because there were some big changes, I ran it myself for two weeks before going public with it. http://crm114.sourceforge.net/crm114-20060118-BlameTheReavers.src.tar.gz http://crm114.sourceforge.net/crm114-20060118-BlameTheReavers.i386.tar.gz Here's the update: "This is a big new functionality release- we include mailtrainer.crm as well as changing the default mailfilter.crm from Markovian to OSB. This new mailtrainer program is fed directories of example texts (one example per file), and produces optimized satistics files matched to your particular mailfilter.cf setup (each 1 meg of example takes about a minute of CPU). It even does N-fold validation. "Default training is 5-pass DSTTTTR (a Fidelis-inspired improvement of TUNE) with a thick threshold of 5.0 pR units. Worst-offender DSTTTTR training as a (very slow) option. There are also speedups and bugfixes throughout the code. Unless you really like Markovian, now is a good time to think about saving your old .css files and switching over to the new default mailfilter.crm config that uses OSB unique microgroom. Then run mailtrainer.crm on your saved spam and good mail files, and see how your accuracy jumps. I'm seeing about a four-fold increase in accuracy on the TREC SA corpus; this is hot stuff indeed. "HOWEVER, the downside is that mailtrainer.crm expects to see your spam and good training data files in a maildir-like format (one dir for spam, the other for good); this isn't directly supported by mailfilter.crm yet, so unless your mailer supports maildirs, you will need to write a little script to build your training data directories. MD5 checksums: 825f8c83cd1a5a83d4ae14db7c163368 crm114-20060118-BlameTheReavers.i386.tar.gz 2ce8d8483c844d51d45f127895bcc89f crm114-20060118-BlameTheReavers.src.tar.gz -Bill Yerazunis |