You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
(27) |
Apr
(25) |
May
(8) |
Jun
(2) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
(1) |
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
|
2007 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
(8) |
Jun
|
Jul
(3) |
Aug
(8) |
Sep
|
Oct
|
Nov
(1) |
Dec
|
2008 |
Jan
(3) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(3) |
Oct
|
Nov
(3) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Christian S. <si...@mi...> - 2004-04-21 14:59:58
|
Bill: On Wed, 21 Apr 2004, Bill Yerazunis wrote: > Last things first: Is the WPCW speedup due to fewer features > generated, or a smaller .css file fitting into cache? I'm quite sure it's mainly due to the number of generated features. WPCW generates only 1/4 of the features of SBPH (for a window size of 5), and run time should be roughly linear to the number of features because one feature is processed after the other. I don't think size of the feature cache matters that much because the algorithm still has to look up each feature (and if it finds it isn't there it has to add it and discard another). I tried WPCW with 400 to 700K features, and it took 26min for 400, 500, and 700K (strangely, for 600K it was 32min, but that was probably on another computer -- I'm using two for my test runs). > I'm going ahead and implementing the word-pairs features. However, > I'm a bit troubled by the name <wpcw> > > I seem to recall that Walsh-Hadamard transforms looked sorta like > those features (specifically, diagonalized Walsh-Hadamard). > > Does that make any sense? Currently I'm using the classify/learn > keyword <walsh> to indicate "not <markov>" but I'm not sure if that's > accurate enough or not... any preferable suggestions? Should I stay > with <walsh> or use <wpcw> (Word Pairs Context Window)? Easy to > change now... hard to change later. I suppose "Word Pairs Context Window" is clearer than "Word Pairs with a Common Word", but both sound somewhat strange. I thought about "sparse bigrams", since bigram is the scientific term for word pair (but then we still need a handy acronym). What do you think, Fidelis? > But what further can be done? Well, clearly you can add layers to the > perceptron network; a particularly nice implementation is the Hopcroft > network which is a rectangular system that allows feedback from > arbitrary summations back into the network. Can you provide pointers to literature about Hopcroft networks and diagonalized Walsh-Hadamard? Need to catch up before I can discuss this... > If I drop down to 128 slots on the input of the Hopcroft, then > it's only 31 megs... close enough to the current 24 that I can > handwave it away. :) Aren't the current CSS files 12MB each (24MB for both spam+nonspam, but not for one)? So is this for the whole network or for a single class? Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- What chaos is left in modern society is a precious commodity. We have to be careful to conserve it... -- Tom DeMarco and Timothy Lister, Peopleware |
From: Bill Y. <ws...@me...> - 2004-04-21 14:01:39
|
Fidelis, Christian, Shalen: [excuse the rambling - I'm doing a brain dump so that I can clarify things to both myself and others] Last things first: Is the WPCW speedup due to fewer features generated, or a smaller .css file fitting into cache? It does matter (in an engineering sense), so is there any way to determine this? If you ran with a 1-megaslot (12-megabyte) .css file, do you still have the same speedup in WPCW ? ----- I'm going ahead and implementing the word-pairs features. However, I'm a bit troubled by the name <wpcw> I seem to recall that Walsh-Hadamard transforms looked sorta like those features (specifically, diagonalized Walsh-Hadamard). Does that make any sense? Currently I'm using the classify/learn keyword <walsh> to indicate "not <markov>" but I'm not sure if that's accurate enough or not... any preferable suggestions? Should I stay with <walsh> or use <wpcw> (Word Pairs Context Window)? Easy to change now... hard to change later. In any case, since there _are_ people who need more blazing speed than they need absolute accuracy, <walsh> or <wpcw> or whatever will be in the next major release. :) ----- As to Winnow... I was doing some reading on it, and it's a perceptron-based algorithm. See: www.cs.princeton.edu/courses/archive/ spring03/cs511/scribe_notes/0325.ps for a nice bit of description of Winnow. Note that if your features are WPCR, SBPH, or other, and then you feed that into Winnow, you now have a multilayer network and so the Perception linearity theorem no longer applies, at least if the window in the first layer can bridge the tokens you're trying to XOR. But what further can be done? Well, clearly you can add layers to the perceptron network; a particularly nice implementation is the Hopcroft network which is a rectangular system that allows feedback from arbitrary summations back into the network. ----- Now, the bizarre thing was on Saturday I was talking to Penney about neural networks and brainstorming how we could maybe shoehorn the feature set into the front end of a Hopcroft-style recursive neural network. Now, <walsh> can cut the feature set down, maybe even to 1/10th the size of full Markov. That means I have 10 times the slot space available (40 bytes instead of 4). Almost enough... If I bite the bullet and go with a window length of 3 tokens, then I use 250K slots, and give each slot 256 bytes of weights (using signed char weights, probably a-law or mu-law encoded) to each of 256 inputs of a 256-square (64Kbyte) square Hopcroft neural network, then I need 62 megs per .neu file. If I drop down to 128 slots on the input of the Hopcroft, then it's only 31 megs... close enough to the current 24 that I can handwave it away. :) Instead of one mu-law weight per neuron, we could have two bytes- a neuron number, and a mu-law weight. The advantage of this is that it scales - we can have 64 freely allocatable neurons in 31 megs. Or we could have 32 neurons in 16 megs, 16 neurons in 8 megs, 8 neurons in 4 megs.... Or we can get funky - if we use the H1 hash values as part of the neuron addressing scheme, we lose flexibility but may gain in number of neurons we can cross-access. For example, if the feature hash H1 is 0xDEADBEEF, then this feature has weights for neurons 0xDE, 0xAD, 0xBE, and 0xEF, and the XORs of those, to wit: 0xDE xor 0xAD --> 1101 1110 xor 1010 1101 --> 0111 0011 --> 0x73 which gives us 16 arbitrarily and randomly (and fixed) neurons per feature. This gives a per-slot memory usage of 4 bytes (H1) + 16 weights + optional H2 = 20 or 24 bytes, compared to the 12 bytes we use right now. And if we only need 250Kslots, then we're down to a 6 megabyte data file. Opinions? Rants? -Bill Yerazunis |
From: Christian S. <si...@mi...> - 2004-04-21 12:15:50
|
On Wed, 21 Apr 2004, Christian Siefkes wrote: > About the tokenization: The pattern I called "simplified tokenization" was > as follows: > > - In CRM syntax: > [[:graph:]][-[:alnum:]]*[[:graph:]]? > > - In Java syntax (Unicode-based): > [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}] Sorry, the trailing '?' got lost. This must read: [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}]? Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- All I want is more than I deserve. |
From: Christian S. <si...@mi...> - 2004-04-21 12:08:57
|
Hi Fidelis: On Mon, 19 Apr 2004, Fidelis Assis wrote: > > I'd really like to try your idea with my algorithm so I'm interested in > > your code as well. Would you allow me to incorporate it into my software > > <http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL > > license? (I cannot use the GPL due to projects I'm involved with.) > > For sure! You can use it as you like. I'll be glad if it helps in any > way. Patch attached. Thanks! In the end I didn't use any of your code because it would have been harder to translate it to my setup and Java than to recode. But I used your idea and that works just beautifully!! Previously I had brought the number of errors from 19 to 18 by using a refined XML+header-aware tokenization pattern (more below). And by switching from SBPH with 1,600,000 features to your WPCW with 600,000 features I got further down to 16! (For comparision: a full one-million bucket CSS files hosts about 2.5M features, your 280K buckets should translate in 700K features, so that's roughly comparable.) 600K vs. 1.6M features might seem somewhat unfair for SBPH, because WPCW generates just 1/4 of the features. But actually it isn't since I also tried SBPH with more [2M or 2.4M] features and that didn't increase performance, indeed the peak was at 1.6M. Just as important, it brought run time over all 41,500 mails down to about 30min on an old Pentium III 1266GHz (invoked using nice i.e. with low priority so as not to disturb the user working on the machine), while previously it was at least 90-120 minutes! Shalen, Bill, and I are just writing a paper about the Winnow think, and that's a great opportunity for presenting your WPCW as well. Since you invented this I would propose you to become a co-author as well. What do you think? ---- Here are the old and new results: SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU pruning, single pass): 592 (number of errors on all 10x4147 mails) 43 (number of errors on the last 10x1000 mails) 20 (number of errors on the last 10x500 mails) SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness, 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU pruning, single pass): 562 (number of errors on all 10x4147 mails) 39 (number of errors on the last 10x1000 mails) 19 (number of errors on the last 10x500 mails) SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness, 1.23 promotion, 0.83 demotion, XML+header-aware tokenization (1.6M features, LRU pruning, single pass): 531 (number of errors on all 10x4147 mails) 34 (number of errors on the last 10x1000 mails) 18 (number of errors on the last 10x500 mails) WPCW+Winnow on normalized mails (normalizemime), 5% threshold thickness, 1.23 promotion, 0.83 demotion, XML+header-aware tokenization (600K features, LRU pruning, single pass): 514 (number of errors on all 10x4147 mails) 33 (number of errors on the last 10x1000 mails) 16 (number of errors on the last 10x500 mails) --- About the tokenization: The pattern I called "simplified tokenization" was as follows: - In CRM syntax: [[:graph:]][-[:alnum:]]*[[:graph:]]? - In Java syntax (Unicode-based): [^\p{Z}\p{C}][-\p{L}\p{M}\p{N}]*[^\p{Z}\p{C}] \p{?} are Unicode categories: [^\p{Z}\p{C}] means everything except whitespace and control chars, \p{L}\p{M}\p{N} are letters, marks [don't ask me what that means], and digits, respectively. I modified this to make it more aware of XML/HTML markup + mail headers (the link breaks do not really exist, I only inserted them for readability): [^\\p{Z}\\p{C}] [/!?#]? [-\\p{L}\\p{M}\\p{N}]* (?:["'=;]|/?>|:/*)? The purpose of the optional second group is to allow matching XML/HTML end tags ( </tag> ), Doctype declarations ( <!DOCTYPE ), processing instructions ( <?xml-stylesheet ) and character references ( – ) in a token. TThe last group has been modified to allow matching the ':' after mail headers, the '=' after XML/HTML attributes, the quotes around attribute values, the ';' terminating character + entity references, XML tags incl. empty tags ( <tag> <br/> ), and protocols such as "http://". On the other hand, puntuation marks such as '.' and ',' are consciously omitted, so normal words will be recognized no matter where in a sentence they occur, they are not "contaminated" by trailing punctuation. Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- There isn't an Internet company in the world that's going to fail because of mistakes. Internet companies make thousands of mistakes every week. -- Candice Carpenter, cofounder of iVillage, February 1998 |
From: Fidelis A. <fi...@po...> - 2004-04-19 14:07:52
|
Christian Siefkes wrote: [...] > > SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion, > simplified tokenization (1.6M features, LRU pruning, single pass): > 592 (number of errors on all 10x4147 mails) > 43 (number of errors on the last 10x1000 mails) > 20 (number of errors on the last 10x500 mails) > > SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness, > 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU > pruning, single pass): > 562 (number of errors on all 10x4147 mails) > 39 (number of errors on the last 10x1000 mails) > 19 (number of errors on the last 10x500 mails) > Fantastic, I want to see your code! Definitely, I need to learn java and that Winnow algorithm... [...] > > > I'd really like to try your idea with my algorithm so I'm interested in > your code as well. Would you allow me to incorporate it into my software > <http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL > license? (I cannot use the GPL due to projects I'm involved with.) For sure! You can use it as you like. I'll be glad if it helps in any way. Patch attached. -- Fidelis |
From: Christian S. <si...@mi...> - 2004-04-19 09:24:39
|
Hi Fidelis: On Sun, 18 Apr 2004, Fidelis Assis wrote: > > SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass: > > 719 (number of errors on all 10x4147 mails) > > 38 (number of errors on the last 10x500 mails) > > Very interesting. That number, 38 errors, is really impressive! I have And it's possible to do better! Meanwhile I tried another variant of the algorithm, tuned some of the parameters, and slightly modified the tokenization patterns (using /[[:graph:]][-[:alnum:]]*[[:graph:]]?/ instead of /[[:graph:]][-.,:[:alnum:]]*[[:graph:]]?/ so domains will be split at dots etc.). This brought me down to 20 errors. That's on the raw mails, without decoding Base64 or anything like that. Using Jaakko Hyvatti's normalizemime for preprocessing brings a small further improvement, reducing the errors from 20 to 19. Now I think I've reached a point where further improvements will be very hard to reach, but that number is quite impressive as it is... SBPH+Winnow, 5% threshold thickness, 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU pruning, single pass): 592 (number of errors on all 10x4147 mails) 43 (number of errors on the last 10x1000 mails) 20 (number of errors on the last 10x500 mails) SBPH+Winnow on normalized mails (normalizemime), 5% threshold thickness, 1.23 promotion, 0.83 demotion, simplified tokenization (1.6M features, LRU pruning, single pass): 562 (number of errors on all 10x4147 mails) 39 (number of errors on the last 10x1000 mails) 19 (number of errors on the last 10x500 mails) > some results I think are interesting too, but with respect to speed. I > tried to speed up classification by removing redundant features in SBPH > and got a classification speed up of more than 2 times, with 1/4 of disk > space, while keeping the same accuracy. Maybe there's a possibility to > merge the results and have better accuracy with faster classification > speed. ... > So, I used only Word Pairs with a Common Word (WPCW) as below: > > - - - w4 w5 - distance: 0 > - - w3 - w5 - distance: 1 > - w2 - - w5 - distance: 2 > w1 - - - w5 - distance: 3 That's an amazing idea! Your results are fascinating indeed... > Bill, can you comment on that and help me to clarify that point? I can > prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or > somebody else wants to give it a try. I'd really like to try your idea with my algorithm so I'm interested in your code as well. Would you allow me to incorporate it into my software <http://www.inf.fu-berlin.de/inst/ag-db/software/ties/> using the LGPL license? (I cannot use the GPL due to projects I'm involved with.) Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- From the point of view of a meaningful life, the entire work/leisure duality must be abandoned. As long as we are living our work or our leisure, we are not even truly living. Meaning cannot be found in work or leisure but has to arise out of the nature of the activity itself. Out of passion. Social value. Creativity. -- Pekka Himanen, The Hacker Ethic |
From: Fidelis A. <fi...@po...> - 2004-04-19 03:24:37
|
On Sun, 18 Apr 2004 22:10:11 -0400, Bill Yerazunis wrote > Fidelis: > > *fascinating*.... > > I was talking today with Penney (my better half, etc) about how to trim > the feature space down enough so that it could be the inputs to a neural > net. Right now there's a million features, and that's far far too many. > > I thought that getting rid of the intermediate features would cause > too much loss of accuracy. There just isn't space in a feature slot > of reasonable size to do what's needed. > > But if you really _can_ ignore a lot of the polynomials, then that > drops the dimensionality down enough to be reasonable as the first stage > of a Hopcroft-style (feedback-capable) neural net. Oh, neural nets... 18 years ago I was getting my master degree in AI and used to play with prolog, lisp, natural language understanding, etc. At that time, some co-students were playing with neural nets and I was very impressed with the possibilities, but life took me to other ways and I had to concentrate on my job as a telecomm engineer and later as an internet servers support engineer. It's interesting to hear about neural nets again. > > And _that_ will kick butt. :) I hope so :) > > ----- > > Now, as to the difference between 56 and 70, I have no clue. But > your set shows that you have a large standard deviation per run; I > suppose I hit a good set. > > Did you try to replicate the run with the ten shuffles I sent out (or > did I send them to you at all? Maybe not...) No, you didn't. But I'd like to run the test with it. > But your results vary all the way from 62 to 82 out of a 10-shuffle set. I believe that it is an indication that the learning step is greater than necessary, which may cause unlearning of things already learned and greater oscillation/delay till convergence, if it converges at all. Some sequences may be more sensible to that effect. > > Did you use the defaults for the build (same :lcr:, same default number > of slots in the .css files?) Yes, I used all the defaults. The only change, applied to the WPCW version, was smaller css files. I used the prime 282001 for the number of buckets - near 1/4 of the default size - because WPCW produces 4 times less features than SBPH. > > Fascinating, either way though. I'm also excited about the possible results... :) -- Fidelis Assis |
From: Bill Y. <ws...@me...> - 2004-04-19 02:10:21
|
Fidelis: *fascinating*.... I was talking today with Penney (my better half, etc) about how to trim the feature space down enough so that it could be the inputs to a neural net. Right now there's a million features, and that's far far too many. I thought that getting rid of the intermediate features would cause too much loss of accuracy. There just isn't space in a feature slot of reasonable size to do what's needed. But if you really _can_ ignore a lot of the polynomials, then that drops the dimensionality down enough to be reasonable as the first stage of a Hopcroft-style (feedback-capable) neural net. And _that_ will kick butt. :) ----- Now, as to the difference between 56 and 70, I have no clue. But your set shows that you have a large standard deviation per run; I suppose I hit a good set. Did you try to replicate the run with the ten shuffles I sent out (or did I send them to you at all? Maybe not...) But your results vary all the way from 62 to 82 out of a 10-shuffle set. Did you use the defaults for the build (same :lcr:, same default number of slots in the .css files?) Fascinating, either way though. -Bill Yerazunis |
From: Fidelis A. <fi...@po...> - 2004-04-19 01:00:51
|
Christian Siefkes wrote: > Hi, > > On Sun, 11 Apr 2004, Bill Yerazunis wrote: > >>Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes, >>this is the best ever) 47 errors out of 5000. That handily beats the >>previous record of 54, which is superexponential weighting on a TUNE. >> >>Fascinating- the same set of weights is WORSE in single pass but better >>in 3-pass. > > > That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude > anything from such a small difference except that they perform very > similar... Maybe you could look at an other/extended set, e.g. the last > 1000 mails of each batch to see whether the difference will be higher? > > > >>Here's the detailed extended report.... and my laptop is literally too >>hot to touch near the CPU, so I will let it cool a bit before >>I run base7 (that being 1, 7, 49, 343, 2401). > > ... > >>SBPH16EM - denomenator 16, Super-Markovian entropic correction >> 0 1 1125 >> 0 1 56 >>SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian >> 0 1 1132 >> 0 1 58 >>16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights >> 0 1 1135 >> 0 1 60 > > ... > > I've got a new result to add, and I'm quite excited about it: > > SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass: > 719 (number of errors on all 10x4147 mails) > 38 (number of errors on the last 10x500 mails) > Very interesting. That number, 38 errors, is really impressive! I have some results I think are interesting too, but with respect to speed. I tried to speed up classification by removing redundant features in SBPH and got a classification speed up of more than 2 times, with 1/4 of disk space, while keeping the same accuracy. Maybe there's a possibility to merge the results and have better accuracy with faster classification speed. SBPH generates 16 features for each word in the text, combining it with the 4 previous ones (5 words window): w1 w2 w3 w4 w5 => - - - - w5 - - - w4 w5 - - w3 - w5 - - w3 w4 w5 - w2 - - w5 - w2 - w4 w5 - w2 w3 - w5 - w2 w3 w4 w5 w1 - - - w5 w1 - - w4 w5 w1 - w3 - w5 w1 - w3 w4 w5 w1 w2 - - w5 w1 w2 - w4 w5 w1 w2 w3 - w5 w1 w2 w3 w4 w5 Clearly, there are many features that are contained in others. Bill's intention, as far as I understood, was exactly that to get some kind of non-linearity and obtain greater discriminating power. I tried a different approach, since accuracy is already good enough for practical purposes, and considered only word pairs with a common word inside the window, instead of all combinations of the first four, plus the last one. The idea behind this approach was to gain speed working only with kind of "prime" features inside the window (more independent features), hoping that the chain rule would, automatically, consider all combinations and produce the same final effect. So, I used only Word Pairs with a Common Word (WPCW) as below: - - - w4 w5 - distance: 0 - - w3 - w5 - distance: 1 - w2 - - w5 - distance: 2 w1 - - - w5 - distance: 3 You can "or" them to get any of the 16 combinations in SBPH, except for the single word combination, which I chose not to use. Closer pairs are given more credence, so the weights are based on the distance, instead of on the length like in SBPH/SM. Test results: I used the same test methodology as described in Bill's Plateau papers: - 10 shuffles of 4147 messages from SpamAssassin corpi, classified as: - 1397 spam (20030228_spam_2.tar.bz2) - 2500 easy ham (20030228_easy_ham.tar.bz2) - 250 hard ham (20030228_hard_ham.tar.bz2) Other information for the test: - Weights for WPCW: 24, 14, 7, 4; - CSS file size for WPCW: 282.001 buckets ( 3.384.012 bytes) - CSS file size for SBPH: 1.048.577 buckets (12.582.924 bytes) - OS: Linux 2.4.20-30.9 - RedHat 9.0 - CPU: AMD Athlon(TM) XP 2400 - CRM version: 20040409-BlameMarys I repeated the test 5 times, each time with a different set of 10 shuffles: Errors/5000 msgs - only errors in last 500 messages of a 10 shuffle set: ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 69 | 70 | 62 | 71 | 82 | 70.8 ------------------------------------------------------------ WPCW | 62 | 72 | 65 | 80 | 73 | 70.4 ------------------------------------------------------------ Errors/41470 msgs - all errors in a 10 shuffle set: ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 1092 | 1100 | 1115 | 1086 | 1110 | 1100.6 ------------------------------------------------------------ WPCW | 1114 | 1126 | 1099 | 1081 | 1129 | 1109.8 ------------------------------------------------------------ Classification time (seconds): ------------------------------------------------------------ Algorithm | run 1 | run 2 | run 3 | run 4 | run 5 | Average ------------------------------------------------------------ SBPH | 3726 | 3824 | 3902 | 3755 | 3971 | 3836 ------------------------------------------------------------ WPCW | 1653 | 1672 | 1665 | 1694 | 1695 | 1675 ------------------------------------------------------------ I considered the classification time of a message as the time to have "crm --stats_only mailfilter.crm < msg" completed from within a perl script. In the test above, WPCW was 2.29 times faster than default CRM (SBPH with "super markovian" weights), while keeping basically the same accuracy - 70 errors in 5000 messages. Also, it allows us to use much smaller css files - disk space reduction of up to 75% -, which, in turn, contributed to increase classification speed. So, I think it's a good option to be added to the source code. These results seems promising, but there's one thing I don't understand (I may have missed something or made some mistake): I was never able to get less than 62 errors in last 500 messages, neither with default CRM nor with WPCW, but in Bill's Plateau99 he mentions 56 errors with SBPH and markovian with 2^(2N) (I took that as "super markovian") and 69 for SBPH only. If default CRM uses SBPH with "super markovian", I should have had an average around 56 errors, but I got 70, which is similar to what Bill had for SBPH only. Bill, can you comment on that and help me to clarify that point? I can prepare a patch to add WPCW to 20040409-BlameMarys and send it if you or somebody else wants to give it a try. Nevertheless, as I ran the same test with default CRM and with WPCW, I tend to believe that the results show important practical advantages of the last approach over the first one, and I'm planning on using it on my production server. Also, there's room for exploring wider sliding windows, for better accuracy, since it's faster. Note for the more theoretically driven guys on the list: I found those weights empirically and it's really a pain to look for the best ones when you don't have a model. I might have just found a local minimum and would like to have a way to do a smarter search or, even better, to have a model that helped in deducing the best weights. Any suggestions or comments would be much appreciated. -- Fidelis Assis |
From: Christian S. <si...@mi...> - 2004-04-13 14:33:29
|
Hi, On Sun, 11 Apr 2004, Bill Yerazunis wrote: > Using 3-pass TUNE, with 1, 3, 13, 75, 531 as weights, I get (and yes, > this is the best ever) 47 errors out of 5000. That handily beats the > previous record of 54, which is superexponential weighting on a TUNE. > > Fascinating- the same set of weights is WORSE in single pass but better > in 3-pass. That's 58 vs. 60 errors in single pass? I suppose it's hard to conclude anything from such a small difference except that they perform very similar... Maybe you could look at an other/extended set, e.g. the last 1000 mails of each batch to see whether the difference will be higher? > Here's the detailed extended report.... and my laptop is literally too > hot to touch near the CPU, so I will let it cool a bit before > I run base7 (that being 1, 7, 49, 343, 2401). ... > SBPH16EM - denomenator 16, Super-Markovian entropic correction > 0 1 1125 > 0 1 56 > SB16-24SM - denom 16, 24 megabyte 512-chain .css, Super-Markovian > 0 1 1132 > 0 1 58 > 16-BCS - denom 16, 512-chain .css, Breyer-Chhabra-Siefkes weights > 0 1 1135 > 0 1 60 ... I've got a new result to add, and I'm quite excited about it: SBPH+Ultraconservative Winnow, 10% threshold, 1.6M features, LRU pruning, single pass: 719 (number of errors on all 10x4147 mails) 38 (number of errors on the last 10x500 mails) This is single-pass (as TOE), not multi-pass (as TUNE), thus it's a decrease in the error rate by about 1/3 (compared to the 56 errors reported by you for Super-Markovian entropic correction)! I did not strictly use CRM114 for these results, but my own Java-based package which I'm developing mainly for information extraction <http://en.wikipedia.org/wiki/Information_extraction> (I now put a preliminary version at http://www.inf.fu-berlin.de/inst/ag-db/software/ties/ ). I started using CRM for my work, but I also looked at other incremental (i.e. single-pass) classification methods and finally developed a combination of the Winnow algorithm (cf. e.g. http://citeseer.ist.psu.edu/article/dagan97mistakedriven.html ) and the "ultraconservative" algorithms introduced in http://citeseer.ist.psu.edu/crammer01ultraconservative.html . I'm also using the "thick threshold" heuristic of dagan97mistakedriven i.e. I train not only on errors but also on "almost-errors" when the scores of other classes are slightly lower than the true score. For feature preprocessing, I combined this classifier with (a primitive but usable re-implementation of) CRM's sparse binary polynomial hashing. For tokenization I used the same pattern as CRM. For the reported result I've used 1.6M features. CRM's CSS files contain 1M buckets by default, but if I can trust cssutil and cssdiff a full (.097%) CSS file stores about 2.5-2.6M hashed datums=features -- so it should be a fair comparison or did I get this wrong? Another important difference is that I use LRU (least recently used) pruning instead of microgrooming. While microgrooming more-or-less randomly deletes a feature if the store is full (or so I understand) I delete the least recently seen feature (all other features where encountered after the victim). Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- Those who would give up essential liberty, to purchase a little temporary safety, deserve neither liberty nor safety. -- Benjamin Franklin |
From: Laird B. <lb...@us...> - 2004-03-31 05:49:40
|
On Mar 31 2004, Raul Miller wrote: > > However, being blunt -- if you can't explain a concept in simple english > you do not understand it. > > Concepts you understand are simple. Concepts you do not really understand > are complex. > That's ok, I thought I could explain that particular issue without resorting to technical language, but I now realize I failed. I hope I didn't confuse you too much, but hopefully if you take the time to read some of that book, something useful can still come out of this. It was a good discussion though. Laird. |
From: Raul M. <mo...@ma...> - 2004-03-31 05:09:26
|
On Tue, Mar 30, 2004 at 11:25:37PM -0500, I wrote: > More specifically, at time T0 it's not known whether a document will be > classified as spam or non-spam. At time T1 this is known. At time T0, > we can assign a probability to the chance that it will be classified > into one of these two sets. I should probably add, in case it's not obvious: the probability P(class|feature) is the classification probability which would be assigned by a hypothetical observer who is only allowed to know the presence or absence of that specific feature and (eventuallY) the document's classification. - Raul |
From: Raul M. <mo...@ma...> - 2004-03-31 04:26:24
|
> > No, that was never established. I've repeatedly asked you to prove this > > assertion, and you've always evaded the point. On Wed, Mar 31, 2004 at 01:45:32PM +1000, Laird Breyer wrote: > Some miscommunication! I've repeatedly explained that if you pick them > independently, then you can violate the ordering P(A) <= P(B) when A > included in B. This is a fundamental property. No further proof is needed. In the last few messages, B has referred to the set a document is classified into, and A has referred to some feature of such a document. My guess, however, is that here you are referring to the probability that some feature A ans some feature B will appear on some arbitrary document within some set. In either case, the relationship you've just specified is not a relationship between probabilities I've stated are independent. > > The closest you've come to proving this is your assertion that P(B|A) > > depends on P(A|B), but that is blatantly false, since there is information > > needed for P(A|B) which is not needed for P(B|A). > > So you're saying that because some extra information is needed for the > full relationship, the two are therefore not related? That's just wrong. I never claimed that they were not related. > This is just so much waffle. If you do this and do that, and pretend > this is that, then it works! > > I've given you a specific set of requirements, e.g. Prob(cat|class) >= > Prob(cat sees|class). Go sit down with a piece of paper and come up > with a specific probability measure on documents which satisfies those > requirements and represents accurately what crm114 is doing. You have > all the time in the world. I'm not claiming Prob(cat|class) is independent of Prob(cat sees|class). I'm claiming Prob(class|cat) is independent of Prob(class|cat sees) > > > You can pick all the P(B|A) together according to carefully picked > > > formulas in terms of the number of documents and whatever else is > > > relevant. Only if you dump multiword features are the restrictions > > > weak enough to pick the P(B|A) effectively independently as you're > > > trying to do. > > > > I'm still waiting for proof. All I've seen so far is hand waving. > > I don't need to prove anything. It's up to you to pick a suitable > measure P. The ones you've come up with so far failed the consistency > test. If you can come up with a consistent probability on documents, > you've won. You've yet to prove that there's anything wrong with Prob(class|feature). > > > (*) otherwise, let's agree to disagree and stop this line of discussion. > > > > You've got three choices: > > > > [a] Prove this point > > [b] Stop making the point > > [c] Realize that I'm going to object every time you make this point. > > > > My preference would be to achieve some kind of mutual understanding > > (leading to either [a] or [b]), however if you insist on mutual > > disagreement (leading to either [b] or [c]) I'm not going to be able to > > stop you. > > We've argued this to death. I'm going to take option c, and stop > arguing in circles. Ok, I'm going to take this as a statement that you have no mathematical basis for your assertion. > You want me to provide an example of a nonexistent probability measure > which is inconsistent? I won't. You've yet to show that these are not probabilities. These probabilities are easily measurable, and easily describable in english. Waving your hands at them do not make them go away. More specifically, at time T0 it's not known whether a document will be classified as spam or non-spam. At time T1 this is known. At time T0, we can assign a probability to the chance that it will be classified into one of these two sets. > If it is impractical to take such a course for various reasons, > perhaps you will enjoy learning from a book. Here's a free one you can > download: > > http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html > > As I said, the important thing is to wrestle with as many exercises as > possible. Only this way will you obtain a feel for how it works. > > Sorry to be so blunt, but this question is no longer worth spending > time on, at least for me. If you feel that I'm just bullshitting, > that's fine too. We'll disagree on this particular topic and that's that. I will read the book you've given a reference to. However, being blunt -- if you can't explain a concept in simple english you do not understand it. Concepts you understand are simple. Concepts you do not really understand are complex. -- Raul |
From: Laird B. <lb...@us...> - 2004-03-31 03:45:40
|
On Mar 31 2004, Raul Miller wrote: > On Tue, Mar 30, 2004 at 04:50:22PM +1000, Laird Breyer wrote: > > Not again! We've established(*) that if you pick the P(B|A) > > independently, you don't generally obtain a probability. > > No, that was never established. I've repeatedly asked you to prove this > assertion, and you've always evaded the point. Some miscommunication! I've repeatedly explained that if you pick them independently, then you can violate the ordering P(A) <= P(B) when A included in B. This is a fundamental property. No further proof is needed. > > The closest you've come to proving this is your assertion that P(B|A) > depends on P(A|B), but that is blatantly false, since there is information > needed for P(A|B) which is not needed for P(B|A). So you're saying that because some extra information is needed for the full relationship, the two are therefore not related? That's just wrong. > > > You must pick the family P(B|A), where B and A vary appropriately, > > in a consistent way. Picking the P(B|A) independently does not guarantee > > consistency. > > You need to be consistent with is the choice of spam vs. not spam. > You have yet to prove that there needs to be any consistency between > P(B|A1) and P(B|A2). > > Hmm... ok, I will grant that if feature A1 contains A2 and A1 has a > non-zero probability then A2 must also have a non-zero probability. > However, A2 can be within epsilon of zero, which I consider to be > close enough. More generally, crm114 works with probabilities in the > open interval (0,1) -- it's an explicit requirement that 1 and 0 are > never valid probabilities. You can never know that a certain word or > phrase will never appear in spam, nor can you ever know that it will > never appear in non-spam. > > Or, if you declare that there are such words, you filter documents which > contain them using your white list or black list, and don't calculate > invalid probabilities for such documents. > This is just so much waffle. If you do this and do that, and pretend this is that, then it works! I've given you a specific set of requirements, e.g. Prob(cat|class) >= Prob(cat sees|class). Go sit down with a piece of paper and come up with a specific probability measure on documents which satisfies those requirements and represents accurately what crm114 is doing. You have all the time in the world. > > You can pick all the P(B|A) together according to carefully picked > > formulas in terms of the number of documents and whatever else is > > relevant. Only if you dump multiword features are the restrictions > > weak enough to pick the P(B|A) effectively independently as you're > > trying to do. > > I'm still waiting for proof. All I've seen so far is hand waving. I don't need to prove anything. It's up to you to pick a suitable measure P. The ones you've come up with so far failed the consistency test. If you can come up with a consistent probability on documents, you've won. > > > (*) otherwise, let's agree to disagree and stop this line of discussion. > > You've got three choices: > > [a] Prove this point > [b] Stop making the point > [c] Realize that I'm going to object every time you make this point. > > My preference would be to achieve some kind of mutual understanding > (leading to either [a] or [b]), however if you insist on mutual > disagreement (leading to either [b] or [c]) I'm not going to be able to > stop you. We've argued this to death. I'm going to take option c, and stop arguing in circles. > > > > It's true that with this information, and some additional information > > > (the cardinality of the B set), we could find P(A|B), but I don't see > > > that this imposes any kind of circular consistency requirement. > > > > Perhaps circular is the wrong word. You cannot pick the P(B|A) unless > > you know that their effect on subsequently calculated P(A|B) are > > acceptable. Moreover, using the rules of probability, you can take > > those calculated P(A|B) and recalculate the P(B|A) from them in many different > > ways, and the result must always be the same. This is not true if you > > pick the P(B|A) independently, unless you're exceedingly lucky in your > > choice. > > If this is the case, your proof should be trivial -- provide an example > of some set of P(B|Ax) which does not have a corresponding set of P(Ax|B) > which satisfy the constraints on P(Ax|B). [Clearly, you'll also need to > list the appropriate constraints, since from my point of view that's > arbitrary.] Note that I'll grant your point for some examples involving > the probabilities 1 or 0, but I've explicitly rejected your point for > all other probabilities. You want me to provide an example of a nonexistent probability measure which is inconsistent? I won't. > If I grant that 1 and 0 are invalid probabilities by definition > (they require certainty about what spammers will do and/or about what > information you'll need to hear about), will you grant my point? > I think what needs to happen is that you take a basic course on probability and statistics. We can't continue to argue about absolutely fundamental issues. Such a course will help in the following way: you'll do lots of varied assignment problems, and you'll start to get a feel for how it works. Most importantly, you'll become familiar with the standard definitions of relevant concepts, and if you'd like to discuss this again, we can then use the proper terminology rather than arguing endlessly about what the basics mean. If it is impractical to take such a course for various reasons, perhaps you will enjoy learning from a book. Here's a free one you can download: http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html As I said, the important thing is to wrestle with as many exercises as possible. Only this way will you obtain a feel for how it works. Sorry to be so blunt, but this question is no longer worth spending time on, at least for me. If you feel that I'm just bullshitting, that's fine too. We'll disagree on this particular topic and that's that. -- Laird Breyer. |
From: Laird B. <lb...@us...> - 2004-03-31 03:05:46
|
On Mar 30 2004, Christian Siefkes wrote: > Hi Laird, > > On Tue, 30 Mar 2004, Laird Breyer wrote: > > (*) you might want to read the exchange I'm having with Christian > > Siefkes, who is trying to get around the difficulties in a promising > > way. He builds a parallel universe D_5 in which features are > > independent exactly as you assume, but (I'm fairly certain) he's in trouble > > when he wants to bring back the result into the real universe D. > > Maybe you're right and it's not possible to consider the > probabilities calculated in D_5 (after the SBPH transformation \phi) > as probabilities on D (prior to this transformation i.e. with > single-word features only). =======PEDANTISM============= Just to be pedantic, for clarity's sake, the probabilty measure P on D_5 makes all features independent, where a feature is an n-gram for n <= 5. If you accept my proof, then you do not obtain a probability Prob on D by simply taking the restriction of P to the subset \phi(D), ie what you wanted to do was Prob(A) = P(A \intersect \phi(D)), Prob(A|B) = P(A \intersect \phi(D) | B \intersect \phi(D)) for all possible sets A, B of interest, eg A = { d is spam } etc. It is entirely possible that a more complex transformation Prob = some transformation of P could work. But I don't have one handy. =======PEDANTISM============= > But it might be enough to stay in D_5 and use the probabilities as > calculated there. After all each document has a unique > representation in D_5 just as in D, and even the untransformed > feature vector in D is not the "real document" because prior to that > you decided how to tokenize, what to discard (e.g. whitespace) > etc... The formalism D -> \phi -> D_5 is also suitable for any other transformation of documents. Let's see how it would work for discarding whitespace. Let \psi(t) = t', where t' is the same string as t, but with consecutive whitespace removed. Now write R for the set of real documents, and R' for the documents with white space removed, so \psi:R -> R'. What is the difference with the situation \phi:D - D_5 ? A crucial step in my proof is that there exists T \in D_5 such that T = \phi(t) has no solution for t in D. For example, the document T = (a,b,c#b) is not invertible into a document in D. However, for \psi, all documents in R' have an antecedent in R. So the memory leak trick I used cannot be used, and in fact because \psi(R) = R', if you look at the pedantic section above you see that you can use the formula as you wanted. So there is a probability Prob on R which can be obtained from P on R', but of course you must construct such a probability P on R' first. So the moral of the story is that some transformations work, and some don't. > > OK, the documents we'll encounter in D_5 are sparse in the sense that > there are documents we'll never see because they cannot be generated by > the SBPH transformation (\phi(D) is a strict subset of D_5). So the Naive > Bayes classifier thinks it is operating on D_5 while truly it is operating > on the subset \phi(D) only. That's bad but I think it's not entirely > unreasonable to hope that the effects of this wrong assumption will > roughly cancel each other out because all classes are affected the same > way. I'm an optimist. This is a good hope, but only a hope. The proof I gave you shows that there is no theoretical justification based on probability theory for making this claim. So yes, we all hope it's true but we cannot stricly argue authoritatively that is so by invoking Bayesian statistics. > > Here is a paper that seems to confirm this hope: > http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf . > They state that Naive Bayes works best if _either_ the features are really > independent (that's obvious) _or_ if the dependencies between features are > functional (deterministic). Now the feature dependencies introduced by > SBPH are completely deterministic -- so this might help to understand why > CRM can introduce these dependencies and still proceed as if they wouldn't > exist without suffering a performance drop. The general approach of this paper is the right direction, I think, but we have to be careful about generalizing its results to the present case. The relevant bit is Theorem 3, which is proven in a tech report by Rish, Hellerstein and Thathachar. If you look at the proof they give (it's Theorem 6), then you see that a crucial assumption is that f(a) == f(b) implies a == b, and in fact if this fails, then their proof breaks down easily. Just think for example of the case where f(x) = constant1 for x \in A, f(x) = cosntant2 for x \notin A. I'm not convinced this theorem is strong enough to be generalized to the case D -> D_5. Consider the simpler case D -> D_2, and consider the natural corresponding definition of \phi_2. \phi_2(a,b,c,d) =(def) (a,b,c,d,a#b,b#c,c#d). We can write this in the form required by their Theorem: let x = (a,b,c,d), then \phi_2(x) = (x, f_1(x), f_2(x), f_3(x)) But their theorem doesn't apply, because f_1(x) = f_1(y) does not imply x = y, etc. There could be a variation of their theorem, and a variation on \phi_2 which together would work, but it is just so much speculation. Intuitively, I don't believe it, at least for now. -- Laird Breyer. |
From: Christian S. <si...@mi...> - 2004-03-30 16:49:34
|
Hi Laird, On Tue, 30 Mar 2004, Laird Breyer wrote: > (*) you might want to read the exchange I'm having with Christian > Siefkes, who is trying to get around the difficulties in a promising > way. He builds a parallel universe D_5 in which features are > independent exactly as you assume, but (I'm fairly certain) he's in trouble > when he wants to bring back the result into the real universe D. Maybe you're right and it's not possible to consider the probabilities calculated in D_5 (after the SBPH transformation \phi) as probabilities on D (prior to this transformation i.e. with single-word features only). But it might be enough to stay in D_5 and use the probabilities as calculated there. After all each document has a unique representation in D_5 just as in D, and even the untransformed feature vector in D is not the "real document" because prior to that you decided how to tokenize, what to discard (e.g. whitespace) etc... OK, the documents we'll encounter in D_5 are sparse in the sense that there are documents we'll never see because they cannot be generated by the SBPH transformation (\phi(D) is a strict subset of D_5). So the Naive Bayes classifier thinks it is operating on D_5 while truly it is operating on the subset \phi(D) only. That's bad but I think it's not entirely unreasonable to hope that the effects of this wrong assumption will roughly cancel each other out because all classes are affected the same way. Here is a paper that seems to confirm this hope: http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf . They state that Naive Bayes works best if _either_ the features are really independent (that's obvious) _or_ if the dependencies between features are functional (deterministic). Now the feature dependencies introduced by SBPH are completely deterministic -- so this might help to understand why CRM can introduce these dependencies and still proceed as if they wouldn't exist without suffering a performance drop. Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- Freedom is being able to make decisions that affect mainly you. Power is being able to make decisions that affect others more than you. If we confuse power with freedom, we will fail to uphold real freedom. -- Bradley M. Kuhn and Richard M. Stallman, Freedom Or Power? |
From: Raul M. <mo...@ma...> - 2004-03-30 14:38:24
|
On Tue, Mar 30, 2004 at 04:50:22PM +1000, Laird Breyer wrote: > Not again! We've established(*) that if you pick the P(B|A) > independently, you don't generally obtain a probability. No, that was never established. I've repeatedly asked you to prove this assertion, and you've always evaded the point. The closest you've come to proving this is your assertion that P(B|A) depends on P(A|B), but that is blatantly false, since there is information needed for P(A|B) which is not needed for P(B|A). > You must pick the family P(B|A), where B and A vary appropriately, > in a consistent way. Picking the P(B|A) independently does not guarantee > consistency. You need to be consistent with is the choice of spam vs. not spam. You have yet to prove that there needs to be any consistency between P(B|A1) and P(B|A2). Hmm... ok, I will grant that if feature A1 contains A2 and A1 has a non-zero probability then A2 must also have a non-zero probability. However, A2 can be within epsilon of zero, which I consider to be close enough. More generally, crm114 works with probabilities in the open interval (0,1) -- it's an explicit requirement that 1 and 0 are never valid probabilities. You can never know that a certain word or phrase will never appear in spam, nor can you ever know that it will never appear in non-spam. Or, if you declare that there are such words, you filter documents which contain them using your white list or black list, and don't calculate invalid probabilities for such documents. > You can pick all the P(B|A) together according to carefully picked > formulas in terms of the number of documents and whatever else is > relevant. Only if you dump multiword features are the restrictions > weak enough to pick the P(B|A) effectively independently as you're > trying to do. I'm still waiting for proof. All I've seen so far is hand waving. > (*) otherwise, let's agree to disagree and stop this line of discussion. You've got three choices: [a] Prove this point [b] Stop making the point [c] Realize that I'm going to object every time you make this point. My preference would be to achieve some kind of mutual understanding (leading to either [a] or [b]), however if you insist on mutual disagreement (leading to either [b] or [c]) I'm not going to be able to stop you. > > It's true that with this information, and some additional information > > (the cardinality of the B set), we could find P(A|B), but I don't see > > that this imposes any kind of circular consistency requirement. > > Perhaps circular is the wrong word. You cannot pick the P(B|A) unless > you know that their effect on subsequently calculated P(A|B) are > acceptable. Moreover, using the rules of probability, you can take > those calculated P(A|B) and recalculate the P(B|A) from them in many different > ways, and the result must always be the same. This is not true if you > pick the P(B|A) independently, unless you're exceedingly lucky in your > choice. If this is the case, your proof should be trivial -- provide an example of some set of P(B|Ax) which does not have a corresponding set of P(Ax|B) which satisfy the constraints on P(Ax|B). [Clearly, you'll also need to list the appropriate constraints, since from my point of view that's arbitrary.] Note that I'll grant your point for some examples involving the probabilities 1 or 0, but I've explicitly rejected your point for all other probabilities. In my opinion, it will be simple for me to show this is not the case for any such set. [Though, example sets which have high cardinality will tend to require more calculations than examples involving smaller sets of Ax.] Is there an example involving P(B|A1) and P(B|A2) you can show me? > > More generally, if "knowing something about" a dependent variable meant > > that some other variable couldn't be independent, then we'd never be able > > to have any independent variables. It's always the case that when you > > are able to find the value of an independent variable you know something > > about any associated dependent variables. > > This has nothing to do with our problem. In our problem, the P(B|A) > depend on the P(A|B), just as the P(A|B) depend on the P(B|A). Give me > any one family and the other family can be deduced to some extent. P(B|A1): 0.75 P(B|A2): 0.25 I'll grant that you can provide non-empty sets of potentially valid answers to the questions: A1 is a feature which contains A2 -- what's P(A1|B) and P(A2|B)? A2 is a feature which contains A1 -- what's P(A1|B) and P(A2|B)? This is because I see P(Ax|B) as dependent on P(B|Ax). What I don't see is how this makes P(B|Ax) not be independent variables. If I grant that 1 and 0 are invalid probabilities by definition (they require certainty about what spammers will do and/or about what information you'll need to hear about), will you grant my point? -- Raul |
From: Laird B. <lb...@us...> - 2004-03-30 06:50:27
|
On Mar 30 2004, Raul Miller wrote: > > It's a circular consistency requirement. You have got to know > > something about the P(A|B) before you can choose the P(B|A), and > > you've got to know the P(B|A) to verify that the P(A|B) are consistent. > > It's a fact of life (or rather, probability theory). > > We don't have to know P(A|B) to find P(B|A), we just have to know > the number of documents in each set which contain the relevant feature. Not again! We've established(*) that if you pick the P(B|A) independently, you don't generally obtain a probability. You must pick the family P(B|A), where B and A vary appropriately, in a consistent way. Picking the P(B|A) independently does not guarantee consistency. You can pick all the P(B|A) together according to carefully picked formulas in terms of the number of documents and whatever else is relevant. Only if you dump multiword features are the restrictions weak enough to pick the P(B|A) effectively independently as you're trying to do. (*) otherwise, let's agree to disagree and stop this line of discussion. > > It's true that with this information, and some additional information > (the cardinality of the B set), we could find P(A|B), but I don't see > that this imposes any kind of circular consistency requirement. Perhaps circular is the wrong word. You cannot pick the P(B|A) unless you know that their effect on subsequently calculated P(A|B) are acceptable. Moreover, using the rules of probability, you can take those calculated P(A|B) and recalculate the P(B|A) from them in many different ways, and the result must always be the same. This is not true if you pick the P(B|A) independently, unless you're exceedingly lucky in your choice. > > More generally, if "knowing something about" a dependent variable meant > that some other variable couldn't be independent, then we'd never be able > to have any independent variables. It's always the case that when you > are able to find the value of an independent variable you know something > about any associated dependent variables. > This has nothing to do with our problem. In our problem, the P(B|A) depend on the P(A|B), just as the P(A|B) depend on the P(B|A). Give me any one family and the other family can be deduced to some extent. -- Laird Breyer. |
From: Raul M. <mo...@ma...> - 2004-03-30 05:59:03
|
> > The set of P(B|A) represents independent probability variables for > > arbitrary B and A. The set of P(A|B) represents probability variables > > which are not independent of each other. P(A|B) is derivable knowing > > P(B|A) and P(B). On Tue, Mar 30, 2004 at 10:50:01AM +1000, Laird Breyer wrote: > You are still supposing that the numbers P(B|A) of interest can be > chosen independently first. But if you do, then the family P(A|B) derived > from them is broken. The only way to ensure the P(A|B) are not broken is > to *not* take the P(B|A) independently. > > It's a circular consistency requirement. You have got to know > something about the P(A|B) before you can choose the P(B|A), and > you've got to know the P(B|A) to verify that the P(A|B) are consistent. > It's a fact of life (or rather, probability theory). We don't have to know P(A|B) to find P(B|A), we just have to know the number of documents in each set which contain the relevant feature. It's true that with this information, and some additional information (the cardinality of the B set), we could find P(A|B), but I don't see that this imposes any kind of circular consistency requirement. More generally, if "knowing something about" a dependent variable meant that some other variable couldn't be independent, then we'd never be able to have any independent variables. It's always the case that when you are able to find the value of an independent variable you know something about any associated dependent variables. -- Raul |
From: Laird B. <lb...@us...> - 2004-03-30 00:50:07
|
> Here's the way I'm seeing it: >=20 > The set of P(B|A) represents independent probability variables for > arbitrary B and A. The set of P(A|B) represents probability variables > which are not independent of each other. P(A|B) is derivable knowing > P(B|A) and P(B). You are still supposing that the numbers P(B|A) of interest can be chosen independently first. But if you do, then the family P(A|B) derived =66rom them is broken. The only way to ensure the P(A|B) are not broken is to *not* take the P(B|A) independently.=20 It's a circular consistency requirement. You have got to know something about the P(A|B) before you can choose the P(B|A), and you've got to know the P(B|A) to verify that the P(A|B) are consistent. It's a fact of life (or rather, probability theory). Note again that the inconsistency only shows up if you deal with features consisting of several words. If all you ever have are one word features, then the P(A|B) are not broken and you can indeed choose your P(B|A) in complete freedom. >=20 > You've been saying that because the variables associated with P(A|B) > are not independent that there is no underlying probability model. > I've been saying that because the variables associated with P(B|A) > are independent, there is an underlying probability model. That's the wrong way to put it. Once you have an underlying probability model, you can derive from *it* all the P(A|B), P(B|A)=20 you ever care to know. The quantities are completely fixed by the model. If you don't have an underlying model, then the quantities P(A|B) and P(B|A) obtained from Bayesian formulas are meaningless. model --> P(A|B), P(B|A). What you're trying to do is to say impose P(B|A) --> discover model consistent with imposed P(B|A) --> P(A|B), P(B|A). This works, but only if you restrict yourself when imposing P(B|A). Independence is not a sufficient restriction for multi word features.=20 =20 >=20 > More specifically, the initial P(B) is 0.5, P(B|A) are the independent > variables, while P(A|B) and P(B)' are derived. >=20 > Do you see some flaw in this point of view? Yes. There are no initially independent variables. This is an inconsistent assumption which causes trouble down the road.=20 The iterative formula used by crm114 is supposed to be used only if a model exists, but no model (*) is consistent with the assumption that the P(B|A) are all independent. So the iterative formula is applied without justification, and probably leads into who knows where. It might work in a couple of cases for completely unrelated reasons, but it won't work consistently in all cases. (*) you might want to read the exchange I'm having with Christian Siefkes, who is trying to get around the difficulties in a promising way. He builds a parallel universe D_5 in which features are independent exactly as you assume, but (I'm fairly certain) he's in trouble= =20 when he wants to bring back the result into the real universe D. >=20 > I think you've said that the chain rule would have to be applied > differently if this were the case? If that's the issue, could you > elaborate a bit? >=20 It's not the issue, it was a remark due to the way you were interpreting the Plocalspam(w) probabilities. I was stating=20 they have the form Plocalspam(feature) =3D P(feature|spam), while you seemed to say Plocalspam(feature) =3D P(spam|feature). If you had been right, then the formula P'(spam) =3D Plocalspam(feat)P(spam)/(...)=20 would simply be wrong. Only by choosing Plocalspam(feature) =3D P(feature|spam) do you get the stated form. --=20 Laird Breyer. |
From: Laird B. <lb...@us...> - 2004-03-30 00:08:46
|
On Mar 29 2004, Christian Siefkes wrote: > > Let M be the 2x|D| matrix, where the row label represents > > {spam,notspam} and the column label represents an enumeration of the > > documents t \in D. > > > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > > > Let N be the |D|x2 matrix, where the row label is a document and the > > column label is in {spam,notspam}: > > > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") > > This looks more or less as if you're applying the transformation function > \phi to each feature t_k in the document t? But you can only apply this > function to the whole document t, otherwise you'll get wrong results... > Sorry, this was meant to be as succinct as possible, but is what you expect. If \phi(t) = U, then \phi(t)_k = U_k where U = (U_1,...,U_|D_5|). Each element of the matrix N_{tr} has a fixed full document t, and a fixed label r. I could have written this as N_{tr} = \prod_k P( G_k = T_k | G is "r"), where T = \phi(t) = (T_1,...,T_n) > > > The last product is because on D_5, the terms T_k are independent, > > conditionally on the spam/notspamc class. I'm only defining these > > matrices so I don't have to write long formulas. > > > > If the measure Prob on D exists and is a probability, then we must > > certainly have > > > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) > > What exactly is it you're postulating here? I am saying: if a probability measure Prob exists on D, then such a measure follows all the usual rules of probability theory. The above is sometimes called the law of total probability, and has the form Prob(A) = \sum_i Prob(A|B_i)Prob(B_i), This equation is true for all probabilities on D, whenever the B_i form an exhaustive partition of D. This is the case with the choice B_t = {d = t}. > > > > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > > v = MNv > > > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T > > What's T here? The set of features T_k in the document T = \phi(t) ? > Oops, that T is for "transpose", because v is a column vector and I can't write columns in an email. Sorry, you can probably ignore that. I'm also using a latex convention that _ indicates a subscript, ^ indicates a superscript. Writing mathematical notation in emails is terribly inefficient. -- Laird Breyer. |
From: Raul M. <mo...@ma...> - 2004-03-29 14:59:58
|
> > Ok, well, I'm pretty sure that's backwards. On Mon, Mar 29, 2004 at 04:07:55PM +1000, Laird Breyer wrote: > I don't think so. If the local probabilities have another > interpretation, then the Bayesian chain rule would have to be applied > differently. > > But you can also check the plateau paper again, on p. 2 the > expression is > > P(in class|feature) = P(feat|in class)P(in class) / > ( P(feat|class)P(in class) + P(feat|not in class)P(not in class) ) > > and if you compare with the code in crm114.c, lines 9141 and 9073, > you see that > > P(feat|in class) = Plocal-spam(feat), > > which is what I'm saying at the top of this email. Hmm... ok, I didn't pay enough attention to the paper. > > Either that, or I don't understand the P(|) notation. > > Possible. Here's the normal definition: P(A|B) = P(A and B)/P(B), and > note that this isn't symmetrical in A and B. That's not the problem -- that's what I thought it meant. > It seems that you've simply misunderstood the notation, > which would explain our difficulties to communicate. It looks like > you've interpreted P(A|B) to mean something like P(B|A). It's more subtle than that -- I think I understand the P(|) notation. Here's the way I'm seeing it: The set of P(B|A) represents independent probability variables for arbitrary B and A. The set of P(A|B) represents probability variables which are not independent of each other. P(A|B) is derivable knowing P(B|A) and P(B). You've been saying that because the variables associated with P(A|B) are not independent that there is no underlying probability model. I've been saying that because the variables associated with P(B|A) are independent, there is an underlying probability model. More specifically, the initial P(B) is 0.5, P(B|A) are the independent variables, while P(A|B) and P(B)' are derived. Do you see some flaw in this point of view? I think you've said that the chain rule would have to be applied differently if this were the case? If that's the issue, could you elaborate a bit? -- Raul |
From: Christian S. <si...@mi...> - 2004-03-29 14:55:06
|
Hi Laird, On Sun, 28 Mar 2004, Laird Breyer wrote: > Here's the proof (a bit long, so I didn't want to include it earlier). hm, there are some points where I'm not sure what you mean. Not sure if we'll find the time to sort this out but maybe... > Let's first assume that D has finite cardinality |D|. This is the case > if the vocabulary is finite and documents have a maximum size. We can > take the limit afterwards. > > Claim: Assume |D| is finite. Define for any t \in D > > > > Prob( t is spam | t = (t_1,...,t_n) ) =(def) > > > P( T is spam | T = \phi(t_1,...,t_n) ) > > Then there exists no probability measure "Prob" defined on D such > that the above are conditional probabilities for all documents t \in D. > > Proof: > > Let M be the 2x|D| matrix, where the row label represents > {spam,notspam} and the column label represents an enumeration of the > documents t \in D. > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > Let N be the |D|x2 matrix, where the row label is a document and the > column label is in {spam,notspam}: > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") This looks more or less as if you're applying the transformation function \phi to each feature t_k in the document t? But you can only apply this function to the whole document t, otherwise you'll get wrong results... > The last product is because on D_5, the terms T_k are independent, > conditionally on the spam/notspamc class. I'm only defining these > matrices so I don't have to write long formulas. > > If the measure Prob on D exists and is a probability, then we must > certainly have > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) What exactly is it you're postulating here? > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > v = MNv > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T What's T here? The set of features T_k in the document T = \phi(t) ? > and we are summing over the real documents t \in D only. > > Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the > special form v = (a, 1 - a)^T with 0 <= a <= 1. Now > > v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives > > \sum_j (\sum_i A_ij) v_j = 1, which simplifies to > > (*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1. > > The elements of A are of course all nonnegative, and I claim there is > no solution for 0 <= a <= 1. If no such a exists, then there is no > measure Prob on D. > > Here is why no such a exists: Let M_5 and N_5 be defined exactly like > M and N, but on D_5, with respect to the probability measure P which > we agree exists. The only difference is that on D_5, there are many > more documents, but still a finite number. > > Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1, > because > > \sum_i (A_5)_{i,spam} = > \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ] > = \sum_{T \in D_5} P(T|spam) = 1. > > where we use P(notspam|T) = 1 - P(spam|T). > Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1. > > But now, notice that > > (**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r}, > > so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[] > > Now that we know that no Prob measure which equals P can exist for D > finite, consider the infinite case. Notice that the proof does not > use the finite nature of D at all. The finite sums on t can be > replaced by infinite sums, and can be interchanged by Fubini's > theorem, because all terms are nonnegative. All we need is to be able > to enumerate the t's on D and D_5. Moreover, the inequality in (**) > is strict even in the infinite case (easy, just use a nonreal > document in D_5 with positive P probability). So the same proof works > in the infinite case. ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- The documentation said "For Windows 95 or better", so I used Linux... ;-) |
From: Laird B. <lb...@us...> - 2004-03-29 06:08:00
|
On Mar 29 2004, Raul Miller wrote: > > crm114 says: > > Assume Plocal-spam(w) to mean that > > > > Plocal-spam(w) = P({document d contains w}| document d is spam) > > Ok, well, I'm pretty sure that's backwards. I don't think so. If the local probabilities have another interpretation, then the Bayesian chain rule would have to be applied differently. But you can also check the plateau paper again, on p. 2 the expression is P(in class|feature) = P(feat|in class)P(in class) / ( P(feat|class)P(in class) + P(feat|not in class)P(not in class) ) and if you compare with the code in crm114.c, lines 9141 and 9073, you see that P(feat|in class) = Plocal-spam(feat), which is what I'm saying at the top of this email. > Either that, or I don't understand the P(|) notation. Possible. Here's the normal definition: P(A|B) = P(A and B)/P(B), and note that this isn't symmetrical in A and B. > > > left. In words, Plocal-spam(w) means "if a document is spam, then > > what's its probability of containing w?". > > That may be what the documentation says, but examination of the associated > formula shows that Plocal-spam(w) really means "if a document contains w, > what is the probability that it is spam?" > It seems that you've simply misunderstood the notation, which would explain our difficulties to communicate. It looks like you've interpreted P(A|B) to mean something like P(B|A). -- Laird Breyer. |
From: Raul M. <mo...@ma...> - 2004-03-29 04:54:42
|
On Mon, Mar 29, 2004 at 02:40:53PM +1000, Laird Breyer wrote: > crm114 says: > ------------ > Assume Plocal-spam(w) to mean that > > > Plocal-spam(w) = P({document d contains w}| document d is spam) Ok, well, I'm pretty sure that's backwards. Either that, or I don't understand the P(|) notation. > left. In words, Plocal-spam(w) means "if a document is spam, then > what's its probability of containing w?". That may be what the documentation says, but examination of the associated formula shows that Plocal-spam(w) really means "if a document contains w, what is the probability that it is spam?" -- Raul |