|
From: Jason A. <ja...@st...> - 2003-09-29 06:29:39
|
On Sun, 28 Sep 2003 19:09:52 -0500, Jason Aycock wrote > On Sun, 28 Sep 2003 15:31:12 -0400, Bill Yerazunis wrote > > From: "Jason Aycock" <ja...@st...> > > > > 1. I got the 9-27-PreBeta4 installed late last night and it's > > working very nicely. Four errors in my first 55 mails, and I > > started this time with a much smaller corpus than I did in July > > (about 120K, rather than the 800K I used in the summer). > > > > Are you using TOE, or training entire corpi? > > Well ... uh, both. I'm only training errors at this point (instead > of also training near misses), but when you say "training entire > corpi" ... I started with that base of 120K each to build initial > CSS. Of course, I was certain that the starting spamtext was spam > and that starting nonspamtext was nonspam. Are you saying a better > result would be to start with empty .CSS and train all errors from > there? My packing ratio at this point is just about 20 percent. Ah, when I wrote this I was thinking only of the TOE approach to new incoming mail. I have always trained on entire corpi. Jaakko clued me in to his TOE scripts for iterative learning. CRM114's decision-making has been so solid that I may just follow this path for a while; I was at roughly 98 percent success in a month's use this summer, based on training 800K corpi all at once. Incidentally, I've always had much better performance than predicted as far as processing time -- running on a multiuser Athlon XP+1800, 512MB PC2700 DDR ... When I trained on nearly 2 megabytes corpus in the summer it took only seconds. Also only seconds to train the new corpi last night, and similarly I have rewrites enabled in mailfilterconfig, with no measurable hit to performance. |