crm114-discuss Mailing List for CRM114 Discriminator

crm114-discuss — For discussion of CRM114 in theory and practice

You can subscribe to this list here.

2004	Jan	Feb	Mar (27)	Apr (25)	May (8)	Jun (2)	Jul	Aug (1)	Sep	Oct	Nov (1)	Dec
2005	Jan (1)	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep (2)	Oct (1)	Nov	Dec
2006	Jan	Feb	Mar	Apr	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct (2)	Nov	Dec
2007	Jan	Feb (1)	Mar	Apr	May (8)	Jun	Jul (3)	Aug (8)	Sep	Oct	Nov (1)	Dec
2008	Jan (3)	Feb	Mar	Apr (2)	May	Jun	Jul (2)	Aug	Sep (3)	Oct	Nov (3)	Dec
2011	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (2)
2013	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 5 > >> (Page 4 of 5)

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Christian S. <si...@mi...> - 2004-04-21 14:59:58

Bill:

On Wed, 21 Apr 2004, Bill Yerazunis wrote:
> Last things first: Is the WPCW speedup due to fewer features
> generated, or a smaller .css file fitting into cache?

I'm quite sure it's mainly due to the number of generated features. WPCW
generates only 1/4 of the features of SBPH (for a window size of 5), and
run time should be roughly linear to the number of features because one
feature is processed after the other. I don't think size of the feature
cache matters that much because the algorithm still has to look up each
feature (and if it finds it isn't there it has to add it and discard
another). I tried WPCW with 400 to 700K features, and it took 26min for
400, 500, and 700K (strangely, for 600K it was 32min, but that was
probably on another computer -- I'm using two for my test runs).

> I'm going ahead and implementing the word-pairs features.  However,
> I'm a bit troubled by the name <wpcw>
>
> I seem to recall that Walsh-Hadamard transforms looked sorta like
> those features (specifically, diagonalized Walsh-Hadamard).
>
> Does that make any sense?  Currently I'm using the classify/learn
> keyword <walsh> to indicate "not <markov>" but I'm not sure if that's
> accurate enough or not... any preferable suggestions?  Should I stay
> with <walsh> or use <wpcw> (Word Pairs Context Window)?  Easy to
> change now... hard to change later.

I suppose "Word Pairs Context Window" is clearer than "Word Pairs with a
Common Word", but both sound somewhat strange. I thought about "sparse
bigrams", since bigram is the scientific term for word pair (but then we
still need a handy acronym). What do you think, Fidelis?

> But what further can be done?  Well, clearly you can add layers to the
> perceptron network; a particularly nice implementation is the Hopcroft
> network which is a rectangular system that allows feedback from
> arbitrary summations back into the network.

Can you provide pointers to literature about Hopcroft networks and
diagonalized Walsh-Hadamard?  Need to catch up before I can discuss
this...

> If I drop down to 128 slots on the input of the Hopcroft, then
> it's only 31 megs... close enough to the current 24 that I can
> handwave it away.  :)

Aren't the current CSS files 12MB each (24MB for both spam+nonspam, but
not for one)? So is this for the whole network or for a single class?

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
What chaos is left in modern society is a precious commodity. We have to
be careful to conserve it...
          -- Tom DeMarco and Timothy Lister, Peopleware

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Bill Y. <ws...@me...> - 2004-04-21 14:01:39

Fidelis, Christian, Shalen:

[excuse the rambling - I'm doing a brain dump so that I can clarify 
things to both myself and others]

Last things first: Is the WPCW speedup due to fewer features
generated, or a smaller .css file fitting into cache?  

It does matter (in an engineering sense), so is there any way to
determine this?  If you ran with a 1-megaslot (12-megabyte) .css file,
do you still have the same speedup in WPCW ?

-----

I'm going ahead and implementing the word-pairs features.  However,
I'm a bit troubled by the name <wpcw>

I seem to recall that Walsh-Hadamard transforms looked sorta like
those features (specifically, diagonalized Walsh-Hadamard).

Does that make any sense?  Currently I'm using the classify/learn
keyword <walsh> to indicate "not <markov>" but I'm not sure if that's
accurate enough or not... any preferable suggestions?  Should I stay
with <walsh> or use <wpcw> (Word Pairs Context Window)?  Easy to
change now... hard to change later.

In any case, since there _are_ people who need more blazing speed than
they need absolute accuracy, <walsh> or <wpcw> or whatever will be in
the next major release.  :)

-----

As to Winnow... I was doing some reading on it, and it's a perceptron-based
algorithm.  See:

 www.cs.princeton.edu/courses/archive/ spring03/cs511/scribe_notes/0325.ps 

for a nice bit of description of Winnow.  Note that if your features
are WPCR, SBPH, or other, and then you feed that into Winnow, you now
have a multilayer network and so the Perception linearity theorem no
longer applies, at least if the window in the first layer can bridge
the tokens you're trying to XOR.

But what further can be done?  Well, clearly you can add layers to the
perceptron network; a particularly nice implementation is the Hopcroft
network which is a rectangular system that allows feedback from
arbitrary summations back into the network.

-----

Now, the bizarre thing was on Saturday I was talking to Penney about
neural networks and brainstorming how we could maybe shoehorn the feature
set into the front end of a Hopcroft-style recursive neural network.

Now, <walsh> can cut the feature set down, maybe even to 1/10th the
size of full Markov.  That means I have 10 times the slot space
available (40 bytes instead of 4).  Almost enough... If I bite the
bullet and go with a window length of 3 tokens, then I use 250K slots,
and give each slot 256 bytes of weights (using signed char weights,
probably a-law or mu-law encoded) to each of 256 inputs of a
256-square (64Kbyte) square Hopcroft neural network, then I need 62
megs per .neu file.

If I drop down to 128 slots on the input of the Hopcroft, then
it's only 31 megs... close enough to the current 24 that I can
handwave it away.  :)  

Instead of one mu-law weight per neuron, we could have two bytes- a
neuron number, and a mu-law weight.  The advantage of this is that it
scales - we can have 64 freely allocatable neurons in 31 megs.  Or we
could have 32 neurons in 16 megs, 16 neurons in 8 megs, 8
neurons in 4 megs....

Or we can get funky - if we use the H1 hash values as part of the 
neuron addressing scheme, we lose flexibility but may gain in
number of neurons we can cross-access.  For example, if the feature
hash H1 is 0xDEADBEEF, then this feature has weights for neurons
0xDE, 0xAD, 0xBE, and 0xEF, and the XORs of those, to wit:

  0xDE xor 0xAD --> 1101 1110 xor 1010 1101 --> 0111 0011 --> 0x73

which gives us 16 arbitrarily and randomly (and fixed) neurons per
feature.  This gives a per-slot memory usage of 4 bytes (H1) + 16
weights + optional H2 = 20 or 24 bytes, compared to the 12 bytes we
use right now.  And if we only need 250Kslots, then we're down
to a 6 megabyte data file.

Opinions?  Rants?

	   -Bill Yerazunis

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Christian S. <si...@mi...> - 2004-04-21 12:15:50

On Wed, 21 Apr 2004, Christian Siefkes wrote:
> About the tokenization: The pattern I called "simplified tokenization" was
> as follows:
>
> - In CRM syntax:
>     [[:graph:]][-[:alnum:]]*[[:graph:]]?
>
> - In Java syntax (Unicode-based):
>     [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}]

Sorry, the trailing '?' got lost. This must read:
     [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}]?

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
All I want is more than I deserve.

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Christian S. <si...@mi...> - 2004-04-21 12:08:57

Hi Fidelis:

On Mon, 19 Apr 2004, Fidelis Assis wrote:
> > I'd really like to try your idea with my algorithm so I'm interested in
> > your code as well. Would you allow me to incorporate it into my software
> > <http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL
> > license? (I cannot use the GPL due to projects I'm involved with.)
>
> For sure! You can use it as you like. I'll be glad if it helps in any
> way. Patch attached.

Thanks! In the end I didn't use any of your code because it would have
been harder to translate it to my setup and Java than to recode. But I
used your idea and that works just beautifully!!

Previously I had brought the number of errors from 19 to 18 by using a
refined XML+header-aware tokenization pattern (more below). And by
switching from SBPH with 1,600,000 features to your WPCW with 600,000
features I got further down to 16! (For comparision: a full one-million
bucket CSS files hosts about 2.5M features, your 280K buckets should
translate in 700K features, so that's roughly comparable.)

600K vs. 1.6M features might seem somewhat unfair for SBPH, because WPCW
generates just 1/4 of the features. But actually it isn't since I also
tried SBPH with more [2M or 2.4M] features and that didn't increase
performance, indeed the peak was at 1.6M.

Just as important, it brought run time over all 41,500 mails down to about
30min on an old Pentium III 1266GHz (invoked using nice i.e. with low
priority so as not to disturb the user working on the machine), while
previously it was at least 90-120 minutes!

Shalen, Bill, and I are just writing a paper about the Winnow think, and
that's a great opportunity for presenting your WPCW as well. Since you
invented this I would propose you to become a co-author as well. What do
you think?

----
Here are the old and new results:

SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion,
simplified tokenization (1.6M features, LRU pruning, single pass):
                   592  (number of errors on all 10x4147 mails)
                    43  (number of errors on the last 10x1000 mails)
                    20  (number of errors on the last 10x500 mails)

SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness,
1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU
pruning, single pass):
                   562  (number of errors on all 10x4147 mails)
                    39  (number of errors on the last 10x1000 mails)
                    19  (number of errors on the last 10x500 mails)

SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness,
1.23 promotion, 0.83 demotion, XML+header-aware tokenization (1.6M
features, LRU pruning, single pass):
                   531  (number of errors on all 10x4147 mails)
                    34  (number of errors on the last 10x1000 mails)
                    18  (number of errors on the last 10x500 mails)

WPCW+Winnow on normalized mails (normalizemime), 5% threshold thickness,
1.23 promotion, 0.83 demotion, XML+header-aware tokenization (600K
features, LRU pruning, single pass):
                   514  (number of errors on all 10x4147 mails)
                    33  (number of errors on the last 10x1000 mails)
                    16  (number of errors on the last 10x500 mails)
---

About the tokenization: The pattern I called "simplified tokenization" was
as follows:

- In CRM syntax:
    [[:graph:]][-[:alnum:]]*[[:graph:]]?

- In Java syntax (Unicode-based):
    [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}]

\p{?} are Unicode categories: [^\p{Z}\p{C}] means everything except
whitespace and control chars, \p{L}\p{M}\p{N} are letters, marks [don't
ask me what that means], and digits, respectively.

I modified this to make it more aware of XML/HTML markup + mail headers
(the link breaks do not really exist, I only inserted them for
readability):

  [^\\p{Z}\\p{C}]
  [/!?#]?
  [-\\p{L}\\p{M}\\p{N}]*
  (?:["'=;]|/?>|:/*)?

The purpose of the optional second group is to allow matching XML/HTML end
tags ( </tag> ), Doctype declarations ( <!DOCTYPE ), processing
instructions ( <?xml-stylesheet ) and character references ( &#150; ) in a
token.

TThe last group has been modified to allow matching the ':' after mail
headers, the '=' after XML/HTML attributes, the quotes around attribute
values, the ';' terminating character + entity references, XML tags incl.
empty tags ( <tag> <br/> ), and protocols such as "http://". On the other
hand, puntuation marks such as '.' and ',' are consciously omitted, so
normal words will be recognized no matter where in a sentence they occur,
they are not "contaminated" by trailing punctuation.

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
There isn't an Internet company in the world that's going to fail because
of mistakes. Internet companies make thousands of mistakes every week.
          -- Candice Carpenter, cofounder of iVillage, February 1998

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Fidelis A. <fi...@po...> - 2004-04-19 14:07:52

Attachments: patch-wpcw-20040409-BlameMarys.gz

Christian Siefkes wrote:
[...]
> 
> SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion,
> simplified tokenization (1.6M features, LRU pruning, single pass):
>                    592  (number of errors on all 10x4147 mails)
>                     43  (number of errors on the last 10x1000 mails)
>                     20  (number of errors on the last 10x500 mails)
> 
> SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness,
> 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU
> pruning, single pass):
>                    562  (number of errors on all 10x4147 mails)
>                     39  (number of errors on the last 10x1000 mails)
>                     19  (number of errors on the last 10x500 mails)
> 

Fantastic, I want to see your code! Definitely, I need to learn java and 
that Winnow algorithm...

[...]

> 
> 
> I'd really like to try your idea with my algorithm so I'm interested in
> your code as well. Would you allow me to incorporate it into my software
> <http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL
> license? (I cannot use the GPL due to projects I'm involved with.)

For sure! You can use it as you like. I'll be glad if it helps in any 
way. Patch attached.

-- 
Fidelis

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Christian S. <si...@mi...> - 2004-04-19 09:24:39

Hi Fidelis:

On Sun, 18 Apr 2004, Fidelis Assis wrote:
> > SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass:
> >                    719  (number of errors on all 10x4147 mails)
> >                     38  (number of errors on the last 10x500 mails)
>
> Very interesting. That number, 38 errors, is really impressive! I have

And it's possible to do better! Meanwhile I tried another variant of the
algorithm, tuned some of the parameters, and slightly modified the
tokenization patterns (using /[[:graph:]][-[:alnum:]]*[[:graph:]]?/
instead of /[[:graph:]][-.,:[:alnum:]]*[[:graph:]]?/ so domains will be
split at dots etc.). This brought me down to 20 errors.

That's on the raw mails, without decoding Base64 or anything like that.
Using Jaakko Hyvatti's normalizemime for preprocessing brings a small
further improvement, reducing the errors from 20 to 19. Now I think I've
reached a point where further improvements will be very hard to reach, but
that number is quite impressive as it is...

SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion,
simplified tokenization (1.6M features, LRU pruning, single pass):
                   592  (number of errors on all 10x4147 mails)
                    43  (number of errors on the last 10x1000 mails)
                    20  (number of errors on the last 10x500 mails)

SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness,
1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU
pruning, single pass):
                   562  (number of errors on all 10x4147 mails)
                    39  (number of errors on the last 10x1000 mails)
                    19  (number of errors on the last 10x500 mails)


> some results I think are interesting too, but with respect to speed. I
> tried  to speed up classification by removing redundant features in SBPH
> and got a classification speed up of more than 2 times, with 1/4 of disk
> space, while keeping the same accuracy. Maybe there's a possibility to
> merge the results and have better accuracy with faster classification
> speed.
...
> So, I used only Word Pairs with a Common Word (WPCW) as below:
>
>    -   -   -  w4  w5  - distance: 0
>    -   -  w3   -  w5  - distance: 1
>    -  w2   -   -  w5  - distance: 2
>   w1   -   -   -  w5  - distance: 3

That's an amazing idea! Your results are fascinating indeed...


> Bill, can you comment on that and help me to clarify that point? I can
> prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or
> somebody else wants to give it a try.

I'd really like to try your idea with my algorithm so I'm interested in
your code as well. Would you allow me to incorporate it into my software
<http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL
license? (I cannot use the GPL due to projects I'm involved with.)

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
From the point of view of a meaningful life, the entire work/leisure
duality must be abandoned. As long as we are living our work or our
leisure, we are not even truly living. Meaning cannot be found in work or
leisure but has to arise out of the nature of the activity itself. Out of
passion. Social value. Creativity.
          -- Pekka Himanen, The Hacker Ethic

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Fidelis A. <fi...@po...> - 2004-04-19 03:24:37

On Sun, 18 Apr 2004 22:10:11 -0400, Bill Yerazunis wrote
> Fidelis:
>
> *fascinating*....
>
> I was talking today with Penney (my better half, etc) about how to trim
> the feature space down enough so that it could be the inputs to a neural
> net.  Right now there's a million features, and that's far far too many.
>
> I thought that getting rid of the intermediate features would cause
> too much loss of accuracy.  There just isn't space in a feature slot
> of reasonable size to do what's needed.
>
> But if you really _can_ ignore a lot of the polynomials, then that
> drops the dimensionality down enough to be reasonable as the first stage
> of a Hopcroft-style (feedback-capable) neural net.

Oh, neural nets... 18 years ago I was getting my master degree in AI and
used to play with prolog, lisp, natural language understanding, etc. At 
that time, some co-students were playing with neural nets and I was very
impressed with the possibilities, but life took me to other ways and I
had to concentrate on my job as a telecomm engineer and later as an 
internet servers support engineer. It's interesting to hear about neural 
nets again.

>
> And _that_ will kick butt.  :)

I hope so :)

>
> -----
>
> Now, as to the difference between 56 and 70, I have no clue.  But
> your set shows that you have a large standard deviation per run; I
> suppose I hit a good set.
>
> Did you try to replicate the run with the ten shuffles I sent out (or
> did I send them to you at all?  Maybe not...)

No, you didn't. But I'd like to run the test with it.

> But your results vary all the way from 62 to 82 out of a 10-shuffle set.

I believe that it is an indication that the learning step is greater
than necessary, which may cause unlearning of things already learned and 
  greater oscillation/delay till convergence, if it converges at all. 
Some sequences may be more sensible to that effect.

>
> Did you use the defaults for the build (same :lcr:, same default number
> of slots in the .css files?)

Yes, I used all the defaults. The only change, applied to the WPCW
version, was smaller css files. I used the prime 282001 for the number
of buckets - near 1/4 of the default size - because WPCW produces 4 
times less features than SBPH.

>
> Fascinating, either way though.

I'm also excited about the possible results... :)

--
Fidelis Assis

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Bill Y. <ws...@me...> - 2004-04-19 02:10:21

Fidelis:

*fascinating*....

I was talking today with Penney (my better half, etc) about how to trim
the feature space down enough so that it could be the inputs to a neural
net.  Right now there's a million features, and that's far far too many.

I thought that getting rid of the intermediate features would cause
too much loss of accuracy.  There just isn't space in a feature slot
of reasonable size to do what's needed.

But if you really _can_ ignore a lot of the polynomials, then that
drops the dimensionality down enough to be reasonable as the first stage
of a Hopcroft-style (feedback-capable) neural net.

And _that_ will kick butt.  :)

-----

Now, as to the difference between 56 and 70, I have no clue.  But your
set shows that you have a large standard deviation per run; I suppose I
hit a good set.

Did you try to replicate the run with the ten shuffles I sent out (or
did I send them to you at all?  Maybe not...)  But your results
vary all the way from 62 to 82 out of a 10-shuffle set.

Did you use the defaults for the build (same :lcr:, same default number
of slots in the .css files?)

Fascinating, either way though.

	     -Bill Yerazunis

Re: [Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Fidelis A. <fi...@po...> - 2004-04-19 01:00:51

Christian Siefkes wrote:
> Hi,
> 
> On Sun, 11 Apr 2004, Bill Yerazunis wrote:
> 
>>Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes,
>>this is the best ever) 47 errors out of 5000.  That handily beats the
>>previous record of 54, which is superexponential weighting on a TUNE.
>>
>>Fascinating- the same set of weights is WORSE in single pass but better
>>in 3-pass.
> 
> 
> That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude
> anything from such a small difference except that they perform very
> similar... Maybe you could look at an other/extended set, e.g. the last
> 1000 mails of each batch to see whether the difference will be higher?
> 
> 
> 
>>Here's the detailed extended report....  and my laptop is literally too
>>hot to touch near the CPU, so I will let it cool a bit before
>>I run base7 (that being 1, 7, 49, 343, 2401).
> 
> ...
> 
>>SBPH16EM - denomenator 16, Super-Markovian entropic correction
>>      0       1    1125
>>      0       1      56
>>SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian
>>      0       1    1132
>>      0       1      58
>>16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights
>>      0       1    1135
>>      0       1      60
> 
> ...
> 
> I've got a new result to add, and I'm quite excited about it:
> 
> SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass:
>                    719  (number of errors on all 10x4147 mails)
>                     38  (number of errors on the last 10x500 mails)
> 

Very interesting. That number, 38 errors, is really impressive! I have
some results I think are interesting too, but with respect to speed. I
tried  to speed up classification by removing redundant features in SBPH
and got a classification speed up of more than 2 times, with 1/4 of disk
space, while keeping the same accuracy. Maybe there's a possibility to
merge the results and have better accuracy with faster classification speed.

SBPH generates 16 features for each word in the text, combining it with
the 4 previous ones (5 words window):

w1 w2 w3 w4 w5 =>

   -   -   -   -  w5
   -   -   -  w4  w5
   -   -  w3   -  w5
   -   -  w3  w4  w5
   -  w2   -   -  w5
   -  w2   -  w4  w5
   -  w2  w3   -  w5
   -  w2  w3  w4  w5
  w1   -   -   -  w5
  w1   -   -  w4  w5
  w1   -  w3   -  w5
  w1   -  w3  w4  w5
  w1  w2   -   -  w5
  w1  w2   -  w4  w5
  w1  w2  w3   -  w5
  w1  w2  w3  w4  w5

Clearly, there are many features that are contained in others. Bill's
intention, as far as I understood, was exactly that to get some kind of
non-linearity and obtain greater discriminating power.

I tried a different approach, since accuracy is already good enough for
practical purposes, and considered only word pairs with a common word
inside the window, instead of all combinations of the first four, plus
the last one. The idea behind this approach was to gain speed working
only with kind of "prime" features inside the window (more independent
features), hoping that the chain rule would, automatically, consider all
combinations and produce the same final effect.

So, I used only Word Pairs with a Common Word (WPCW) as below:

   -   -   -  w4  w5  - distance: 0
   -   -  w3   -  w5  - distance: 1
   -  w2   -   -  w5  - distance: 2
  w1   -   -   -  w5  - distance: 3

You can "or" them to get any of the 16 combinations in SBPH, except for
the single word combination, which I chose not to use. Closer pairs are
given more credence, so the weights are based on the distance, instead
of on the length like in SBPH/SM.

Test results:

I used the same test methodology as described in Bill's Plateau papers:

- 10 shuffles of 4147 messages from SpamAssassin corpi, classified as:
   - 1397 spam       (20030228_spam_2.tar.bz2)
   - 2500 easy ham   (20030228_easy_ham.tar.bz2)
   -  250 hard ham   (20030228_hard_ham.tar.bz2)

Other information for the test:

- Weights for WPCW: 24, 14, 7, 4;
- CSS file size for WPCW:   282.001 buckets ( 3.384.012 bytes)
- CSS file size for SBPH: 1.048.577 buckets (12.582.924 bytes)
- OS: Linux 2.4.20-30.9 - RedHat 9.0
- CPU: AMD Athlon(TM) XP 2400
- CRM version: 20040409-BlameMarys

I repeated the test 5 times, each time with a different set of 10 shuffles:

Errors/5000 msgs - only errors in last 500 messages of a 10 shuffle set:
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 69    |  70   |  62   |  71   |  82   |  70.8
------------------------------------------------------------
   WPCW     | 62    |  72   |  65   |  80   |  73   |  70.4
------------------------------------------------------------

Errors/41470 msgs - all errors in a 10 shuffle set:
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 1092  | 1100  | 1115  | 1086  | 1110  | 1100.6
------------------------------------------------------------
   WPCW     | 1114  | 1126  | 1099  | 1081  | 1129  | 1109.8
------------------------------------------------------------

Classification time (seconds):
------------------------------------------------------------
Algorithm  | run 1 | run 2 | run 3 | run 4 | run 5 | Average
------------------------------------------------------------
   SBPH     | 3726  | 3824  | 3902  | 3755  | 3971  | 3836
------------------------------------------------------------
   WPCW     | 1653  | 1672  | 1665  | 1694  | 1695  | 1675
------------------------------------------------------------

I considered the classification time of a message as the time to have
"crm --stats_only mailfilter.crm < msg" completed from within a perl script.

In the test above, WPCW was 2.29 times faster than default CRM (SBPH
with "super markovian" weights), while keeping basically the same
accuracy - 70 errors in 5000 messages. Also, it allows us to use much
smaller css files - disk space reduction of up to 75% -, which, in turn,
contributed to increase classification speed. So, I think it's a good
option to be added to the source code.

These results seems promising, but there's one thing I don't understand
(I may have missed something or made some mistake): I was never able to
get less than 62 errors in last 500 messages, neither with default CRM
nor with WPCW, but in Bill's Plateau99 he mentions 56 errors with SBPH
and markovian with 2^(2N) (I took that as "super markovian") and 69 for
SBPH only. If default CRM uses SBPH with "super markovian", I should
have had an average around 56 errors, but I got 70, which is similar to
what Bill had for SBPH only.

Bill, can you comment on that and help me to clarify that point? I can
prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or
somebody else wants to give it a try.

Nevertheless, as I ran the same test with default CRM and with WPCW, I
tend to believe that the results show important practical advantages of
the last approach over the first one, and I'm planning on using it on my
production server. Also, there's room for exploring wider sliding
windows, for better accuracy, since it's faster.

Note for the more theoretically driven guys on the list: I found those
weights empirically and it's really a pain to look for the best ones
when you don't have a model. I might have just found a local minimum and
would like to have a way to do a smarter search or, even better, to have
a model that helped in deducing the best weights. Any suggestions or
comments would be much appreciated.

-- 
Fidelis Assis

[Crm114-discuss] New Results (Re: CRM114 Testing: Urgent: Please reply)

From: Christian S. <si...@mi...> - 2004-04-13 14:33:29

Hi,

On Sun, 11 Apr 2004, Bill Yerazunis wrote:
> Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes,
> this is the best ever) 47 errors out of 5000.  That handily beats the
> previous record of 54, which is superexponential weighting on a TUNE.
>
> Fascinating- the same set of weights is WORSE in single pass but better
> in 3-pass.

That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude
anything from such a small difference except that they perform very
similar... Maybe you could look at an other/extended set, e.g. the last
1000 mails of each batch to see whether the difference will be higher?

> Here's the detailed extended report....  and my laptop is literally too
> hot to touch near the CPU, so I will let it cool a bit before
> I run base7 (that being 1, 7, 49, 343, 2401).
...
> SBPH16EM - denomenator 16, Super-Markovian entropic correction
>       0       1    1125
>       0       1      56
> SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian
>       0       1    1132
>       0       1      58
> 16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights
>       0       1    1135
>       0       1      60
...

I've got a new result to add, and I'm quite excited about it:

SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass:
                   719  (number of errors on all 10x4147 mails)
                    38  (number of errors on the last 10x500 mails)

This is single-pass (as TOE), not multi-pass (as TUNE), thus it's a
decrease in the error rate by about 1/3 (compared to the 56 errors
reported by you for Super-Markovian entropic correction)!

I did not strictly use CRM114 for these results, but my own Java-based
package which I'm developing mainly for information extraction
<http://en.wikipedia.org/wiki/Information_extraction> (I now put a
preliminary version at
http://www.inf.fu-berlin.de/inst/ag-db/software/ties/ ).

I started using CRM for my work, but I also looked at other incremental
(i.e. single-pass) classification methods and finally developed a
combination of the Winnow algorithm (cf. e.g.
http://citeseer.ist.psu.edu/article/dagan97mistakedriven.html ) and the
"ultraconservative" algorithms introduced in
http://citeseer.ist.psu.edu/crammer01ultraconservative.html . I'm also
using the "thick threshold" heuristic of dagan97mistakedriven i.e. I train
not only on errors but also on "almost-errors" when the scores of other
classes are slightly lower than the true score.

For feature preprocessing, I combined this classifier with (a primitive
but usable re-implementation of) CRM's sparse binary polynomial hashing.
For tokenization I used the same pattern as CRM.

For the reported result I've used 1.6M features. CRM's CSS files contain
1M buckets by default, but if I can trust cssutil and cssdiff a full
(.097%) CSS file stores about 2.5-2.6M hashed datums=features -- so it
should be a fair comparison or did I get this wrong?

Another important difference is that I use LRU (least recently used)
pruning instead of microgrooming. While microgrooming more-or-less
randomly deletes a feature if the store is full (or so I understand)  I
delete the least recently seen feature (all other features where
encountered after the victim).

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
Those who would give up essential liberty, to purchase a little temporary
safety, deserve neither liberty nor safety.
          -- Benjamin Franklin