[Crm114-announce] New version on the website. With OSB-Winnow.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

New sekrit version on the website.

    http://crm114.sourceforge.net/crm114-20040625-BlameSiefkes

MD5sum:

 692c410ddb40b525bafb05b91ee4ba2d  crm114-20040625-BlameSeifkes.src.tar.gz

OK, I think I've gotten most of the bugs out of OSB-Winnow (thanks,
Christian Siefkes, for coming up with the OSB-Winnow abomination,
which earned you the blame for this release).

The stats are pretty durn good.

Using the standard test set (the SpamAssassin one)

  SBPH-Markovian gets 54 errors in TUNE, 56 in TOE 
  OSB-Winnow gets 41 errors in TOE.  
	     (promote 1.23, demote 0.83, thickness 0.05)

It seems to be winning.  :)

Interesting side bit- somehow I'm tickling a bug in normalizemime, as
it greatly _increases_ errors when I use it.  Maybe I've got a bad
distro of it, because when I feed a mime'd mail into it, it does
NOT undo the base64 contents, even when it's just plain text inside
the base64.  Jakko, should it do that?  This is version 2004-02-04.

   ---------------------------------------------------

Warning- OSB-Winnow data files are NOT COMPATIBLE with Markovian
files, and cssutil/cssdiff/cssmerge does NOT work on them, although
they think they do!

This is all very experimental.  But I'm going to put it up anyway
so people can beat upon it.  

If you want to use it, you should also learn about "thickness
learning", where you learn into one file or the other proportionally
based on an error criterion.  More on that when I figure it out 
better and can put it into a script that anyone can use.

       -Bill Yerazunis

	----------------------------------------

Here's the notes:

Besides the usual minor bugfixes (thanks!) there are two big new features
in this revision:

1) We now test against ( and ship with ) TRE version 0.6.8 .  Better,
faster, all that.  :)

2) A fourth new classifier with very impressive statistics is now
available.  This is the OSB-Winnow classifier, originally designed by
Christian Siefkes.  It combines the OSB frontend with a balanced
Winnow backend.  But it may well be twice as accurate as SBPH Markovian
and four times more accurate than Bayesian.   Like correlative matching,
it does NOT produce a direct probability, but it does produce a pR,
and it's integrated into the CLASSIFY statement.  You invoke it
with the <winnow> flag:

        classify <winnow> (file1.owf | file2.owf) /token_regex/
and
        learn <winnow> (file1.owf) /token_regex/
        learn <winnow refute> (file2.owf) /token_regex/

Note that you MUST do two learns on a Winnow .owf files- one
"positive" one on the correct class, and a "refute" learn on
the incorrect class (actually, it's more complicated than that
and I'm still working out the details and optimal settings.)

Being experimental, the OSB-Winnow file format is NOT compatible with
Markovian, OSB, nor correlator matching, and there's no functional
checking mechanism to verify you haven't mixed up a .owf file with a
.css file.  Cssutil, cssdiff, and cssmerge think they can handle the
new format- but they can't.

Further, you currently have to train it in a two-step
process, learning it into one file, and refuting it in all other
files:

        LEARN <winnow>         (file1.owf) /regex/
then
        LEARN <winnow refute>  (file2.owf) /regex/

which will do the right thing.  If the OSB-winnow system works as well
as we hope, we may put the work into adding CLASSIFY-like multifile
syntax into the LEARN statement so you don't have to do this two-step
dance.