RE: [Classifier4j-devel] Bayesian Classification

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

None of this may be new information, but maybe it could be useful.

Bayes applied to multiple "buckets"
http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137

Symptom Model
http://sourceforge.net/docman/display_doc.php?docid=16368&group_id=63137

These are from the Technical section in the POPFile Docs. at:
http://sourceforge.net/docman/?group_id=63137

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN

-----Original Message-----
From: Nick Lothian <nl...@es...>
To: "'cla...@li...'" <classifier4j-
de...@li...>
Date: Thu, 13 Nov 2003 14:34:15 +1030
Subject: RE: [Classifier4j-devel] Bayesian Classification

> > 
> > BayesianClassifier classifier = new BayesianClassifier(wds);
> > double probability=classifier.classify("category","text to be 
> > classified");
> > 
> 
> Yep, looks good.
> 
> > This is functioning fine but most of my probabilities are 
> > either 0.01 or 0.99.
> > 
> 
> Yes, that is pretty much the way it works. See
> <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137>
> for why this is.
> 
> POPFile uses logarithms to get around this (which is actually quite a good
> idea). Classifier4J uses cut-offs to avoid underflow and overflow.
> 
> > I saw somewhere in the source that the algorithm is choosing X most 
> > significant words.  Is there an easy way for me to determine 
> > what these words 
> > are on a category by category basis?
> > 
> 
> No, you can't do this on a category by category basis. I haven't found it
> makes a big difference anyway, so I'd test changing this setting on a single
> category before you spend a lot of time changing this.
> 
> > I think the fact that I still have html in my training data 
> > is causing me 
> > difficulty.  Does this sound right?
> > 
> 
> It depends on the task. In your typical Spam classifier HTML is an important
> indicator. I use Classifier4J for classifying RSS feeds and I don't strip
> HTML (That's not to say it wouldn't work better if I did - I just haven't
> tried it).
> 
> In a lot of cases the HTML supplies a surprising amount of useful data which
> Classifier4J can use.
> 
> 
> > As for the issues with the Bayesian tokenizer:
> > ---- correspondence between Pete and Nick on 2003-08-09
> > Pete> Look into the current Tokenizer - For example, "1.4" 
> > currently gets 
> > split into "1" and "4". Shouldn't it just be "1.4"? Also 
> > "peter's" is split 
> > into "peter" and "s". Shouldn't this be "peter's"? It's 
> > probably worth coming
> > up with a set of test cases.
> >  
> > Nick> Yes, that needs fixing. Also, I'm not sure about how to 
> > deal with URLs: 
> > at the moment http://www.google.com/something gets split up, 
> > but I think it 
> > probably shouldn't (?)
> > ----
> > 
> > Is this still outstanding?
> > 
> 
> Yes, these points are outstanding.
> 
> > How sophicated is this Bayesian classifier when compared with 
> > POPFile or 
> > SpamAssassin?  There is some intersting reading about the 
> > POPFile engine at :
> > http://sourceforge.net/docman/?group_id=63137
> > 
> 
> SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and
> filters mail like that (AFAIK?)
> 
> Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also
> "naive" in the technical meaning of the word - it treats each word as being
> independent). It implements Bayes theorem with very little variation (you
> can specify to only use the X most significant words, and it uses cut-offs
> to avoid arithmetic underflow).
> 
> I'm more interested in investigating a Vector-Space classifier than
> investing a huge amount of time modifying the Bayesian algorithm, especially
> since the modifications most people do are aimed at detecting Spam, which
> isn't my core goal with C4J. OTOH, if someone can suggest a change that
> improves performance or gives some other tangible gain then I'm interested.
> 
> 
> > POPFile has been designed to classify emails into "buckets" 
> > or categories.  
> > Evidently, there are some mathematical shortcuts if you're 
> > trying to classify 
> > a message against several different categories.
> > 
> 
> Can you point me at them? I didn't see them in that doco in the POPFile
> project.
> 
> > One critial point made in POPfile is that words NOT in a 
> > document may be as 
> > important as the words that ARE in a document.  Does the c4J 
> > take this into 
> > account?
> > 
> 
> I'm not quite sure what you (or they?) mean here. Can you point me at what
> they say?
> 
> Classifier4J does (kind of) take words that are not in a document into
> account. If a particualar word (say "Java") isn't in a document then the
> document won't get the score-boost of having that word in there.
> 
> Nick
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: ApacheCon 2003,
> 16-19 November in Las Vegas. Learn firsthand the latest
> developments in Apache, PHP, Perl, XML, Java, MySQL,
> WebDAV, and more! http://www.apachecon.com/
> _______________________________________________
> Classifier4j-devel mailing list
> Cla...@li...
> https://lists.sourceforge.net/lists/listinfo/classifier4j-devel