Re: [Crm114-general] 9-27 progress report and two logistical questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sun, 28 Sep 2003 19:09:52 -0500, Jason Aycock wrote
> On Sun, 28 Sep 2003 15:31:12 -0400, Bill Yerazunis wrote
> > From: "Jason Aycock" <ja...@st...>
> > 
> >    1. I got the 9-27-PreBeta4 installed late last night and it's
> >    working very nicely. Four errors in my first 55 mails, and I
> >    started this time with a much smaller corpus than I did in July
> >    (about 120K, rather than the 800K I used in the summer).
> > 
> > Are you using TOE, or training entire corpi?
> 
> Well ... uh, both. I'm only training errors at this point (instead 
> of also training near misses), but when you say "training entire 
> corpi" ... I started with that base of 120K each to build initial 
> CSS. Of course, I was certain that the starting spamtext was spam 
> and that starting nonspamtext was nonspam. Are you saying a better 
> result would be to start with empty .CSS and train all errors from 
> there? My packing ratio at this point is just about 20 percent.

Ah, when I wrote this I was thinking only of the TOE approach to new incoming
mail. I have always trained on entire corpi. Jaakko clued me in to his TOE
scripts for iterative learning. CRM114's decision-making has been so solid
that I may just follow this path for a while; I was at roughly 98 percent
success in a month's use this summer, based on training 800K corpi all at once.

Incidentally, I've always had much better performance than predicted as far as
processing time -- running on a multiuser Athlon XP+1800, 512MB PC2700 DDR ...
When I trained on nearly 2 megabytes corpus in the summer it took only 
seconds. Also only seconds to train the new corpi last night, and similarly I
have rewrites enabled in mailfilterconfig, with no measurable hit to performance.