From: Bill Y. <ws...@me...> - 2004-06-26 01:01:58
|
New sekrit version on the website. http://crm114.sourceforge.net/crm114-20040625-BlameSiefkes MD5sum: 692c410ddb40b525bafb05b91ee4ba2d crm114-20040625-BlameSeifkes.src.tar.gz OK, I think I've gotten most of the bugs out of OSB-Winnow (thanks, Christian Siefkes, for coming up with the OSB-Winnow abomination, which earned you the blame for this release). The stats are pretty durn good. Using the standard test set (the SpamAssassin one) SBPH-Markovian gets 54 errors in TUNE, 56 in TOE OSB-Winnow gets 41 errors in TOE. (promote 1.23, demote 0.83, thickness 0.05) It seems to be winning. :) Interesting side bit- somehow I'm tickling a bug in normalizemime, as it greatly _increases_ errors when I use it. Maybe I've got a bad distro of it, because when I feed a mime'd mail into it, it does NOT undo the base64 contents, even when it's just plain text inside the base64. Jakko, should it do that? This is version 2004-02-04. --------------------------------------------------- Warning- OSB-Winnow data files are NOT COMPATIBLE with Markovian files, and cssutil/cssdiff/cssmerge does NOT work on them, although they think they do! This is all very experimental. But I'm going to put it up anyway so people can beat upon it. If you want to use it, you should also learn about "thickness learning", where you learn into one file or the other proportionally based on an error criterion. More on that when I figure it out better and can put it into a script that anyone can use. -Bill Yerazunis ---------------------------------------- Here's the notes: Besides the usual minor bugfixes (thanks!) there are two big new features in this revision: 1) We now test against ( and ship with ) TRE version 0.6.8 . Better, faster, all that. :) 2) A fourth new classifier with very impressive statistics is now available. This is the OSB-Winnow classifier, originally designed by Christian Siefkes. It combines the OSB frontend with a balanced Winnow backend. But it may well be twice as accurate as SBPH Markovian and four times more accurate than Bayesian. Like correlative matching, it does NOT produce a direct probability, but it does produce a pR, and it's integrated into the CLASSIFY statement. You invoke it with the <winnow> flag: classify <winnow> (file1.owf | file2.owf) /token_regex/ and learn <winnow> (file1.owf) /token_regex/ learn <winnow refute> (file2.owf) /token_regex/ Note that you MUST do two learns on a Winnow .owf files- one "positive" one on the correct class, and a "refute" learn on the incorrect class (actually, it's more complicated than that and I'm still working out the details and optimal settings.) Being experimental, the OSB-Winnow file format is NOT compatible with Markovian, OSB, nor correlator matching, and there's no functional checking mechanism to verify you haven't mixed up a .owf file with a .css file. Cssutil, cssdiff, and cssmerge think they can handle the new format- but they can't. Further, you currently have to train it in a two-step process, learning it into one file, and refuting it in all other files: LEARN <winnow> (file1.owf) /regex/ then LEARN <winnow refute> (file2.owf) /regex/ which will do the right thing. If the OSB-winnow system works as well as we hope, we may put the work into adding CLASSIFY-like multifile syntax into the LEARN statement so you don't have to do this two-step dance. |