From: Fidelis A. <fi...@po...> - 2004-04-19 01:00:51
|
Christian Siefkes wrote: > Hi, > > On Sun, 11 Apr 2004, Bill Yerazunis wrote: > >>Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes, >>this is the best ever) 47 errors out of 5000. That handily beats the >>previous record of 54, which is superexponential weighting on a TUNE. >> >>Fascinating- the same set of weights is WORSE in single pass but better >>in 3-pass. > > > That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude > anything from such a small difference except that they perform very > similar... Maybe you could look at an other/extended set, e.g. the last > 1000 mails of each batch to see whether the difference will be higher? > > > >>Here's the detailed extended report.... and my laptop is literally too >>hot to touch near the CPU, so I will let it cool a bit before >>I run base7 (that being 1, 7, 49, 343, 2401). > > ... > >>SBPH16EM - denomenator 16, Super-Markovian entropic correction >> 0 1 1125 >> 0 1 56 >>SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian >> 0 1 1132 >> 0 1 58 >>16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights >> 0 1 1135 >> 0 1 60 > > ... > > I've got a new result to add, and I'm quite excited about it: > > SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass: > 719 (number of errors on all 10x4147 mails) > 38 (number of errors on the last 10x500 mails) > Very interesting. That number, 38 errors, is really impressive! I have some results I think are interesting too, but with respect to speed. I tried to speed up classification by removing redundant features in SBPH and got a classification speed up of more than 2 times, with 1/4 of disk space, while keeping the same accuracy. Maybe there's a possibility to merge the results and have better accuracy with faster classification speed. SBPH generates 16 features for each word in the text, combining it with the 4 previous ones (5 words window): w1 w2 w3 w4 w5 => - - - - w5 - - - w4 w5 - - w3 - w5 - - w3 w4 w5 - w2 - - w5 - w2 - w4 w5 - w2 w3 - w5 - w2 w3 w4 w5 w1 - - - w5 w1 - - w4 w5 w1 - w3 - w5 w1 - w3 w4 w5 w1 w2 - - w5 w1 w2 - w4 w5 w1 w2 w3 - w5 w1 w2 w3 w4 w5 Clearly, there are many features that are contained in others. Bill's intention, as far as I understood, was exactly that to get some kind of non-linearity and obtain greater discriminating power. I tried a different approach, since accuracy is already good enough for practical purposes, and considered only word pairs with a common word inside the window, instead of all combinations of the first four, plus the last one. The idea behind this approach was to gain speed working only with kind of "prime" features inside the window (more independent features), hoping that the chain rule would, automatically, consider all combinations and produce the same final effect. So, I used only Word Pairs with a Common Word (WPCW) as below: - - - w4 w5 - distance: 0 - - w3 - w5 - distance: 1 - w2 - - w5 - distance: 2 w1 - - - w5 - distance: 3 You can "or" them to get any of the 16 combinations in SBPH, except for the single word combination, which I chose not to use. Closer pairs are given more credence, so the weights are based on the distance, instead of on the length like in SBPH/SM. Test results: I used the same test methodology as described in Bill's Plateau papers: - 10 shuffles of 4147 messages from SpamAssassin corpi, classified as: - 1397 spam (20030228_spam_2.tar.bz2) - 2500 easy ham (20030228_easy_ham.tar.bz2) - 250 hard ham (20030228_hard_ham.tar.bz2) Other information for the test: - Weights for WPCW: 24, 14, 7, 4; - CSS file size for WPCW: 282.001 buckets ( 3.384.012 bytes) - CSS file size for SBPH: 1.048.577 buckets (12.582.924 bytes) - OS: Linux 2.4.20-30.9 - RedHat 9.0 - CPU: AMD Athlon(TM) XP 2400 - CRM version: 20040409-BlameMarys I repeated the test 5 times, each time with a different set of 10 shuffles: Errors/5000 msgs - only errors in last 500 messages of a 10 shuffle set: ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 69 | 70 | 62 | 71 | 82 | 70.8 ------------------------------------------------------------ WPCW | 62 | 72 | 65 | 80 | 73 | 70.4 ------------------------------------------------------------ Errors/41470 msgs - all errors in a 10 shuffle set: ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 1092 | 1100 | 1115 | 1086 | 1110 | 1100.6 ------------------------------------------------------------ WPCW | 1114 | 1126 | 1099 | 1081 | 1129 | 1109.8 ------------------------------------------------------------ Classification time (seconds): ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 3726 | 3824 | 3902 | 3755 | 3971 | 3836 ------------------------------------------------------------ WPCW | 1653 | 1672 | 1665 | 1694 | 1695 | 1675 ------------------------------------------------------------ I considered the classification time of a message as the time to have "crm --stats_only mailfilter.crm < msg" completed from within a perl script. In the test above, WPCW was 2.29 times faster than default CRM (SBPH with "super markovian" weights), while keeping basically the same accuracy - 70 errors in 5000 messages. Also, it allows us to use much smaller css files - disk space reduction of up to 75% -, which, in turn, contributed to increase classification speed. So, I think it's a good option to be added to the source code. These results seems promising, but there's one thing I don't understand (I may have missed something or made some mistake): I was never able to get less than 62 errors in last 500 messages, neither with default CRM nor with WPCW, but in Bill's Plateau99 he mentions 56 errors with SBPH and markovian with 2^(2N) (I took that as "super markovian") and 69 for SBPH only. If default CRM uses SBPH with "super markovian", I should have had an average around 56 errors, but I got 70, which is similar to what Bill had for SBPH only. Bill, can you comment on that and help me to clarify that point? I can prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or somebody else wants to give it a try. Nevertheless, as I ran the same test with default CRM and with WPCW, I tend to believe that the results show important practical advantages of the last approach over the first one, and I'm planning on using it on my production server. Also, there's room for exploring wider sliding windows, for better accuracy, since it's faster. Note for the more theoretically driven guys on the list: I found those weights empirically and it's really a pain to look for the best ones when you don't have a model. I might have just found a local minimum and would like to have a way to do a smarter search or, even better, to have a model that helped in deducing the best weights. Any suggestions or comments would be much appreciated. -- Fidelis Assis |