Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Christian Siefkes wrote:
> Hi,
> 
> On Sun, 11 Apr 2004, Bill Yerazunis wrote:
> 
>>Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes,
>>this is the best ever) 47 errors out of 5000.  That handily beats the
>>previous record of 54, which is superexponential weighting on a TUNE.
>>
>>Fascinating- the same set of weights is WORSE in single pass but better
>>in 3-pass.
> 
> 
> That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude
> anything from such a small difference except that they perform very
> similar... Maybe you could look at an other/extended set, e.g. the last
> 1000 mails of each batch to see whether the difference will be higher?
> 
> 
> 
>>Here's the detailed extended report....  and my laptop is literally too
>>hot to touch near the CPU, so I will let it cool a bit before
>>I run base7 (that being 1, 7, 49, 343, 2401).
> 
> ...
> 
>>SBPH16EM - denomenator 16, Super-Markovian entropic correction
>>      0       1    1125
>>      0       1      56
>>SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian
>>      0       1    1132
>>      0       1      58
>>16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights
>>      0       1    1135
>>      0       1      60
> 
> ...
> 
> I've got a new result to add, and I'm quite excited about it:
> 
> SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass:
>                    719  (number of errors on all 10x4147 mails)
>                     38  (number of errors on the last 10x500 mails)
> 

Very interesting. That number, 38 errors, is really impressive! I have
some results I think are interesting too, but with respect to speed. I
tried  to speed up classification by removing redundant features in SBPH
and got a classification speed up of more than 2 times, with 1/4 of disk
space, while keeping the same accuracy. Maybe there's a possibility to
merge the results and have better accuracy with faster classification speed.

SBPH generates 16 features for each word in the text, combining it with
the 4 previous ones (5 words window):

w1 w2 w3 w4 w5 =>

   -   -   -   -  w5
   -   -   -  w4  w5
   -   -  w3   -  w5
   -   -  w3  w4  w5
   -  w2   -   -  w5
   -  w2   -  w4  w5
   -  w2  w3   -  w5
   -  w2  w3  w4  w5
  w1   -   -   -  w5
  w1   -   -  w4  w5
  w1   -  w3   -  w5
  w1   -  w3  w4  w5
  w1  w2   -   -  w5
  w1  w2   -  w4  w5
  w1  w2  w3   -  w5
  w1  w2  w3  w4  w5

Clearly, there are many features that are contained in others. Bill's
intention, as far as I understood, was exactly that to get some kind of
non-linearity and obtain greater discriminating power.

I tried a different approach, since accuracy is already good enough for
practical purposes, and considered only word pairs with a common word
inside the window, instead of all combinations of the first four, plus
the last one. The idea behind this approach was to gain speed working
only with kind of "prime" features inside the window (more independent
features), hoping that the chain rule would, automatically, consider all
combinations and produce the same final effect.

So, I used only Word Pairs with a Common Word (WPCW) as below:

   -   -   -  w4  w5  - distance: 0
   -   -  w3   -  w5  - distance: 1
   -  w2   -   -  w5  - distance: 2
  w1   -   -   -  w5  - distance: 3

You can "or" them to get any of the 16 combinations in SBPH, except for
the single word combination, which I chose not to use. Closer pairs are
given more credence, so the weights are based on the distance, instead
of on the length like in SBPH/SM.

Test results:

I used the same test methodology as described in Bill's Plateau papers:

- 10 shuffles of 4147 messages from SpamAssassin corpi, classified as:
   - 1397 spam       (20030228_spam_2.tar.bz2)
   - 2500 easy ham   (20030228_easy_ham.tar.bz2)
   -  250 hard ham   (20030228_hard_ham.tar.bz2)

Other information for the test:

- Weights for WPCW: 24, 14, 7, 4;
- CSS file size for WPCW:   282.001 buckets ( 3.384.012 bytes)
- CSS file size for SBPH: 1.048.577 buckets (12.582.924 bytes)
- OS: Linux 2.4.20-30.9 - RedHat 9.0
- CPU: AMD Athlon(TM) XP 2400
- CRM version: 20040409-BlameMarys

I repeated the test 5 times, each time with a different set of 10 shuffles:

Errors/5000 msgs - only errors in last 500 messages of a 10 shuffle set:
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 69    |  70   |  62   |  71   |  82   |  70.8
------------------------------------------------------------
   WPCW     | 62    |  72   |  65   |  80   |  73   |  70.4
------------------------------------------------------------

Errors/41470 msgs - all errors in a 10 shuffle set:
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 1092  | 1100  | 1115  | 1086  | 1110  | 1100.6
------------------------------------------------------------
   WPCW     | 1114  | 1126  | 1099  | 1081  | 1129  | 1109.8
------------------------------------------------------------

Classification time (seconds):
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 3726  | 3824  | 3902  | 3755  | 3971  | 3836
------------------------------------------------------------
   WPCW     | 1653  | 1672  | 1665  | 1694  | 1695  | 1675
------------------------------------------------------------

I considered the classification time of a message as the time to have
"crm --stats_only mailfilter.crm < msg" completed from within a perl script.

In the test above, WPCW was 2.29 times faster than default CRM (SBPH
with "super markovian" weights), while keeping basically the same
accuracy - 70 errors in 5000 messages. Also, it allows us to use much
smaller css files - disk space reduction of up to 75% -, which, in turn,
contributed to increase classification speed. So, I think it's a good
option to be added to the source code.

These results seems promising, but there's one thing I don't understand
(I may have missed something or made some mistake): I was never able to
get less than 62 errors in last 500 messages, neither with default CRM
nor with WPCW, but in Bill's Plateau99 he mentions 56 errors with SBPH
and markovian with 2^(2N) (I took that as "super markovian") and 69 for
SBPH only. If default CRM uses SBPH with "super markovian", I should
have had an average around 56 errors, but I got 70, which is similar to
what Bill had for SBPH only.

Bill, can you comment on that and help me to clarify that point? I can
prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or
somebody else wants to give it a try.

Nevertheless, as I ran the same test with default CRM and with WPCW, I
tend to believe that the results show important practical advantages of
the last approach over the first one, and I'm planning on using it on my
production server. Also, there's room for exploring wider sliding
windows, for better accuracy, since it's faster.

Note for the more theoretically driven guys on the list: I found those
weights empirically and it's really a pain to look for the best ones
when you don't have a model. I might have just found a local minimum and
would like to have a way to do a smarter search or, even better, to have
a model that helped in deducing the best weights. Any suggestions or
comments would be much appreciated.

-- 
Fidelis Assis