From: Christian S. <si...@mi...> - 2004-04-21 14:59:58
|
Bill: On Wed, 21 Apr 2004, Bill Yerazunis wrote: > Last things first: Is the WPCW speedup due to fewer features > generated, or a smaller .css file fitting into cache? I'm quite sure it's mainly due to the number of generated features. WPCW generates only 1/4 of the features of SBPH (for a window size of 5), and run time should be roughly linear to the number of features because one feature is processed after the other. I don't think size of the feature cache matters that much because the algorithm still has to look up each feature (and if it finds it isn't there it has to add it and discard another). I tried WPCW with 400 to 700K features, and it took 26min for 400, 500, and 700K (strangely, for 600K it was 32min, but that was probably on another computer -- I'm using two for my test runs). > I'm going ahead and implementing the word-pairs features. However, > I'm a bit troubled by the name <wpcw> > > I seem to recall that Walsh-Hadamard transforms looked sorta like > those features (specifically, diagonalized Walsh-Hadamard). > > Does that make any sense? Currently I'm using the classify/learn > keyword <walsh> to indicate "not <markov>" but I'm not sure if that's > accurate enough or not... any preferable suggestions? Should I stay > with <walsh> or use <wpcw> (Word Pairs Context Window)? Easy to > change now... hard to change later. I suppose "Word Pairs Context Window" is clearer than "Word Pairs with a Common Word", but both sound somewhat strange. I thought about "sparse bigrams", since bigram is the scientific term for word pair (but then we still need a handy acronym). What do you think, Fidelis? > But what further can be done? Well, clearly you can add layers to the > perceptron network; a particularly nice implementation is the Hopcroft > network which is a rectangular system that allows feedback from > arbitrary summations back into the network. Can you provide pointers to literature about Hopcroft networks and diagonalized Walsh-Hadamard? Need to catch up before I can discuss this... > If I drop down to 128 slots on the input of the Hopcroft, then > it's only 31 megs... close enough to the current 24 that I can > handwave it away. :) Aren't the current CSS files 12MB each (24MB for both spam+nonspam, but not for one)? So is this for the whole network or for a single class? Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- What chaos is left in modern society is a precious commodity. We have to be careful to conserve it... -- Tom DeMarco and Timothy Lister, Peopleware |