You can subscribe to this list here.
2003 
_{Jan}

_{Feb}

_{Mar}
(12) 
_{Apr}
(53) 
_{May}
(72) 
_{Jun}
(168) 
_{Jul}
(170) 
_{Aug}
(89) 
_{Sep}
(172) 
_{Oct}
(165) 
_{Nov}
(188) 
_{Dec}
(254) 

2004 
_{Jan}
(288) 
_{Feb}
(610) 
_{Mar}
(389) 
_{Apr}
(281) 
_{May}
(160) 
_{Jun}
(278) 
_{Jul}
(209) 
_{Aug}
(202) 
_{Sep}
(223) 
_{Oct}
(173) 
_{Nov}
(126) 
_{Dec}
(108) 
2005 
_{Jan}
(92) 
_{Feb}
(70) 
_{Mar}
(207) 
_{Apr}
(132) 
_{May}
(38) 
_{Jun}
(111) 
_{Jul}
(168) 
_{Aug}
(87) 
_{Sep}
(155) 
_{Oct}
(135) 
_{Nov}
(189) 
_{Dec}
(97) 
2006 
_{Jan}
(39) 
_{Feb}
(142) 
_{Mar}
(109) 
_{Apr}
(273) 
_{May}
(104) 
_{Jun}
(62) 
_{Jul}
(132) 
_{Aug}
(86) 
_{Sep}
(93) 
_{Oct}
(144) 
_{Nov}
(136) 
_{Dec}
(111) 
2007 
_{Jan}
(306) 
_{Feb}
(121) 
_{Mar}
(126) 
_{Apr}
(118) 
_{May}
(10) 
_{Jun}
(62) 
_{Jul}
(111) 
_{Aug}
(267) 
_{Sep}
(63) 
_{Oct}
(153) 
_{Nov}
(101) 
_{Dec}
(34) 
2008 
_{Jan}
(83) 
_{Feb}
(32) 
_{Mar}
(175) 
_{Apr}
(136) 
_{May}
(67) 
_{Jun}
(33) 
_{Jul}
(35) 
_{Aug}
(87) 
_{Sep}
(104) 
_{Oct}
(61) 
_{Nov}
(9) 
_{Dec}
(37) 
2009 
_{Jan}
(44) 
_{Feb}
(114) 
_{Mar}
(94) 
_{Apr}
(48) 
_{May}
(45) 
_{Jun}
(38) 
_{Jul}
(19) 
_{Aug}
(7) 
_{Sep}
(2) 
_{Oct}
(5) 
_{Nov}
(14) 
_{Dec}

2010 
_{Jan}
(2) 
_{Feb}

_{Mar}
(13) 
_{Apr}

_{May}

_{Jun}
(9) 
_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}
(11) 
2011 
_{Jan}
(8) 
_{Feb}
(2) 
_{Mar}

_{Apr}
(27) 
_{May}
(1) 
_{Jun}
(11) 
_{Jul}
(25) 
_{Aug}

_{Sep}
(11) 
_{Oct}
(9) 
_{Nov}

_{Dec}

2012 
_{Jan}
(13) 
_{Feb}
(4) 
_{Mar}
(8) 
_{Apr}
(4) 
_{May}

_{Jun}
(2) 
_{Jul}

_{Aug}

_{Sep}
(5) 
_{Oct}
(1) 
_{Nov}
(1) 
_{Dec}
(8) 
2013 
_{Jan}

_{Feb}

_{Mar}

_{Apr}
(4) 
_{May}
(9) 
_{Jun}
(8) 
_{Jul}
(1) 
_{Aug}
(1) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}

2014 
_{Jan}

_{Feb}

_{Mar}
(5) 
_{Apr}

_{May}

_{Jun}

_{Jul}

_{Aug}
(5) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}

2015 
_{Jan}

_{Feb}
(1) 
_{Mar}

_{Apr}

_{May}

_{Jun}

_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}

2016 
_{Jan}

_{Feb}
(5) 
_{Mar}

_{Apr}

_{May}

_{Jun}

_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}
(2) 
_{Dec}

S  M  T  W  T  F  S 


1
(25) 
2
(24) 
3
(22) 
4
(19) 
5
(23) 
6
(17) 
7
(27) 
8
(53) 
9
(21) 
10
(24) 
11
(1) 
12

13
(2) 
14
(8) 
15
(3) 
16
(8) 
17
(2) 
18
(6) 
19
(8) 
20
(10) 
21
(8) 
22
(8) 
23
(14) 
24
(15) 
25
(13) 
26
(15) 
27
(2) 
28
(2) 
29

30
(4) 
31
(5) 



From: Bill Yerazunis <wsy@me...>  20040318 21:36:16

From: dummy@... Hi all, I read on the webpage of CRM114 that it can be used as logfile analysistool. Is there a howto or some guidlines available ? I want to analyse linux message log files and firewall iptables logs. (sorry for the delays in responding; I've been offline in Dublin) I've played with logfile analysis, but haven't written up anything; my logfiles are not suifficiently interesting for such work. ;) Bill Yerazunis 
From: Raul Miller <moth@ma...>  20040318 21:28:45

On Tue, Mar 16, 2004 at 06:20:29PM 0500, chrislistcrm114@... wrote: > but I can't find anywhere these limits can be set. Look for ulimit in the script which starts qmail. Also note that it's likely you can raise the limit just for crm114.  Raul 
From: Raul Miller <moth@ma...>  20040318 16:44:33

On Thu, Mar 18, 2004 at 02:18:53PM +1000, Laird Breyer wrote: > For example, in the Naive Bayes family of models, the random rules > are assumed to be so that individual words are picked randomly. Of course, Naive Bayes is more general than that, such that "the random rules are assumed to be so that individual phrases are picked randomly" is also an example of Naive Bayes. FYI,  Raul 
From: Leonard Lin <lhl@us...>  20040318 05:44:55

On Mar 14, 2004, at 8:08 PM, Dan Parsons wrote: > Has anyone here happened to write/come across a set of AppleScripts > that aid in using CRM114? Like automatically inserting your command > line, including the email with headers, etc... ? Just curious. If not, > I'll probably write some myself. I wrote a couple a while back that I've been using with good success: http://randomfoo.net/code/Mail.app/ .l 
From: Laird Breyer <lbreyer@us...>  20040318 04:19:00

On Mar 17 2004, Shalen wrote: > Hey there, > > I have been off this list for a while. Can some one point out to me > that what is the current model in the .css file in the Williams > CRM114? Is it Bayesian or SuperIncreasing or Markovian. I remember > there was a heated debate and we came to a conclusion that the word > "markovian" is misleading. Bill should be able to describe the most up to date choice of model used in the current version. But IIRC, there's a text file which is distributed with the crm source, and that should be up to date. > Can someone correct me /elaborate on this? I also remember Laird > gave plenty good explanations but I still have this doubt lurking > that the Willams paper does not give a justification for the use of > Markovian Model. In short, can someone point out to me, how can one > apply a Markovian Model to Spam Filtering and the theoretical > justifications. To summarize what I said at the time (and I don't think this has changed so far): Markovian can mean many things. Academically, it is used when the actual observed state (ie the words in the email) follows a Markov chain (http://mathworld.wolfram.com/MarkovChain.html) IIRC, Bill was arguing that he has a Hidden Markov Model (HMM in academic speak), ie where the state is not observed, but exists hidden away like the proverbial éminence grise. In such a model, the Markov chain evolves and the words in the email are created indirectly, by reference to the hidden chain's current state. To estimate the state, the words are analysed in groups. A generalization on these models is the Markov Random Field, which I'm pretty sure Bill's models all belong to (random fields are extremely general, so this isn't saying much  all bounded range interaction models in statistical physics are of this type). But I'm digressing from your point. To apply a statistical model to spam filtering, you do the same as you do for text classification, except you need to decide what words to use, which words to throw away, which to convert etc. in the context of a mail message. So what do you do for text classification? Formally, you need to imagine that all possible text documents can be constructed from a few random rules with parameters to be chosen. You write down the rules, invert them to obtain a probability likelihood. Then if you're a Bayesian, you apply Bayes'rule to obtain probabilities for the parameters, given the sample documents. If you're not Bayesian (ie what's known as a classical statistician, like Fisher was), then you simply choose the parameters to get the biggest probability likelihood for all the sample documents at once. For example, in the Naive Bayes family of models, the random rules are assumed to be so that individual words are picked randomly. To obtain a likelihood function of a document, you suppose p_1,...,p_k are (unknown) probabilities of picking the 1st,...,kth word. The likelihood of the document (word1,word8,word4,word234) is the function L(word1,word8,word4,word234) = p_1 * p_8 * p_4 * p_234. Now you write this down for all the documents you have, and compute the full likelihood of all the documents. You simplify the equations, and you search for the combination of p_1,...,p_k which either maximize the full likelihood, or represent the Bayesian rule. Because in this model, the words are chosen independently, when you simplify, you see that you can basically choose p_1 = (number of times word1 seen)/(number of documents) etc. If you're Bayesian, the formula is slightly different. If you drop the Naive Bayes assumption, then the likelihood L is no longer a simple product, and the formula for p_1 is wrong too. A straight Markov sequence model has a likelihood of the form L(word1,word8,word4,word234) = p_18 * p_84 * p_4,234. and instead of needing values for p_1,...,p_k, you need values for p_11,...,p_1k, p_21,...,p_2k, ... , p_k1,...,p_kk. But otherwise, it's the same idea. > (Pointers, tutorials anything). Depending on how much mathematics you have in your background, the first step is to take an introduction to statistical inference. If you dive right into classification techniques, then you will often only see the final formula, without explanation where it comes from. Oh, I forgot to tell you what to do with the model. Once you've used the sample documents to compute the values for p_1,...,p_k in the example above, then you can use the function L to calculate the likelihood of *any* document, not just the documents you've got in your corpus. So when you see a new document, you calculate L on that new document, and that gives you the probability that the document comes from the model associated with L. If you have several models, ie one model for spam, one model for ham, then you can see which likelihood is the biggest. That's the best category for your document, in a nutshell. > I wonder whether Markovian Model would be useful if yes how can > we prove it? Regardless of the independence assumption (we use That's easy. You don't need models for that, unless you want to prove it theoretically for all possible corpora. Simply do predictions on the same datasets, using different classifier implementations. I'm glossing over some technical details, like significance measures. Theoretically, the Naive Bayesian models are a special case of Markov models, so it's obvious that Markov is an improvement, because it can adapt at least as well as Naive Bayes. But that's missing the practical aspects because you need bigger corpora etc. Shannon was the first to show visually how good Markov models are. Here's a nice nontechnical explanation: http://www.cs.belllabs.com/cm/cs/pearls/sec153.html > "Naive Bayes" instead of "Bayes" so thats fine) Bayesian Model gives > pretty accurate results  the confidence is evident from the fact > that the community has largely accepted Bayesian Style Spam Filters > which are over 30 by today and designing and building a Markovian > Style Spam filter would look a little odd and out of the way unless > it has a substantial improvements, of which there is little scope > since Bayesian Filters report accuracy greater than 97% There are some filters out there which do nth order Markov sequences. I've written one called dbacl, but it does more than you want, because the estimates use maximum entropy techniques. See the sourceforge homepage for it if you're interested. It also includes a postscript paper describing all the mathematical details  but don't read unless you like formulas. For other filters, look for ones which use "bigram" in their description. Because all such filters have many points of discrepancy, (ie different tokenizers, different smoothing parameters, slightly different models) it is going to be difficult to pinpoint exactly which aspect is responsible for performance, or lack thereof. For example, with dbacl in bigram mode, the headers are parsed as bigrams, and that's not ideal. I'll need to find the time to improve this (instead of posting novels on mailing lists). > > If the current model in CRM114 is not Bayesian, then how can one > compare the performance of the Bayesian model and the current model > (whatever it is) using crm114. Compile an old version of crm114, compile a new version. Run prediction tests on the same set of emails, using each version. Compare the number of errors, and if there is a big difference, then it's conclusive. If it's a small difference, it might mean nothing. > > Ofcourse the training database has to be new and one has to do TOE > in order to achieve the accuracy mentioned. > Personally, I don't like TOE because it's not independent of the stream ordering. But people must use what works for them. > > Shalendra Chhabra, > Graduate Student in Computer Science, > University of California, Riverside > Riverside, > CA, USA > Oh, didn't realize you're a computer scientist. In that case, you shouldn't be afraid of formulas. Here's a couple of papers I haven't gotten around to reading yet. Titles look interesting, and probably have lots of references. http://www.cs.tcd.ie/Cormac.OBrien/papers.html  Laird Breyer. 
From: Dan Parsons <dan@st...>  20040318 01:26:41

I'm getting a number of erroneous spams or nonspams that CRM114 reports as having already learned correctly, despite the fact that it classified it incorrectly. A number of my friends are having these issues, too. Is there some address I should send examples of these emails to, so the developers can see why CRM114 doesn't grok them properly? Any other way I can help make CRM114 better by not doing this? :) Dan Parsons http://androidslibrary.com/ 