RE: [Classifier4j-devel] Bayesian Classification
Status: Beta
Brought to you by:
nicklothian
From: Matt C. <MCo...@my...> - 2003-11-13 04:30:35
|
None of this may be new information, but maybe it could be useful. Bayes applied to multiple "buckets" http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137 Symptom Model http://sourceforge.net/docman/display_doc.php?docid=16368&group_id=63137 These are from the Technical section in the POPFile Docs. at: http://sourceforge.net/docman/?group_id=63137 Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN -----Original Message----- From: Nick Lothian <nl...@es...> To: "'cla...@li...'" <classifier4j- de...@li...> Date: Thu, 13 Nov 2003 14:34:15 +1030 Subject: RE: [Classifier4j-devel] Bayesian Classification > > > > BayesianClassifier classifier = new BayesianClassifier(wds); > > double probability=classifier.classify("category","text to be > > classified"); > > > > Yep, looks good. > > > This is functioning fine but most of my probabilities are > > either 0.01 or 0.99. > > > > Yes, that is pretty much the way it works. See > <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137> > for why this is. > > POPFile uses logarithms to get around this (which is actually quite a good > idea). Classifier4J uses cut-offs to avoid underflow and overflow. > > > I saw somewhere in the source that the algorithm is choosing X most > > significant words. Is there an easy way for me to determine > > what these words > > are on a category by category basis? > > > > No, you can't do this on a category by category basis. I haven't found it > makes a big difference anyway, so I'd test changing this setting on a single > category before you spend a lot of time changing this. > > > I think the fact that I still have html in my training data > > is causing me > > difficulty. Does this sound right? > > > > It depends on the task. In your typical Spam classifier HTML is an important > indicator. I use Classifier4J for classifying RSS feeds and I don't strip > HTML (That's not to say it wouldn't work better if I did - I just haven't > tried it). > > In a lot of cases the HTML supplies a surprising amount of useful data which > Classifier4J can use. > > > > As for the issues with the Bayesian tokenizer: > > ---- correspondence between Pete and Nick on 2003-08-09 > > Pete> Look into the current Tokenizer - For example, "1.4" > > currently gets > > split into "1" and "4". Shouldn't it just be "1.4"? Also > > "peter's" is split > > into "peter" and "s". Shouldn't this be "peter's"? It's > > probably worth coming > > up with a set of test cases. > > > > Nick> Yes, that needs fixing. Also, I'm not sure about how to > > deal with URLs: > > at the moment http://www.google.com/something gets split up, > > but I think it > > probably shouldn't (?) > > ---- > > > > Is this still outstanding? > > > > Yes, these points are outstanding. > > > How sophicated is this Bayesian classifier when compared with > > POPFile or > > SpamAssassin? There is some intersting reading about the > > POPFile engine at : > > http://sourceforge.net/docman/?group_id=63137 > > > > SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and > filters mail like that (AFAIK?) > > Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also > "naive" in the technical meaning of the word - it treats each word as being > independent). It implements Bayes theorem with very little variation (you > can specify to only use the X most significant words, and it uses cut-offs > to avoid arithmetic underflow). > > I'm more interested in investigating a Vector-Space classifier than > investing a huge amount of time modifying the Bayesian algorithm, especially > since the modifications most people do are aimed at detecting Spam, which > isn't my core goal with C4J. OTOH, if someone can suggest a change that > improves performance or gives some other tangible gain then I'm interested. > > > > POPFile has been designed to classify emails into "buckets" > > or categories. > > Evidently, there are some mathematical shortcuts if you're > > trying to classify > > a message against several different categories. > > > > Can you point me at them? I didn't see them in that doco in the POPFile > project. > > > One critial point made in POPfile is that words NOT in a > > document may be as > > important as the words that ARE in a document. Does the c4J > > take this into > > account? > > > > I'm not quite sure what you (or they?) mean here. Can you point me at what > they say? > > Classifier4J does (kind of) take words that are not in a document into > account. If a particualar word (say "Java") isn't in a document then the > document won't get the score-boost of having that word in there. > > Nick > > > ------------------------------------------------------- > This SF.Net email sponsored by: ApacheCon 2003, > 16-19 November in Las Vegas. Learn firsthand the latest > developments in Apache, PHP, Perl, XML, Java, MySQL, > WebDAV, and more! http://www.apachecon.com/ > _______________________________________________ > Classifier4j-devel mailing list > Cla...@li... > https://lists.sourceforge.net/lists/listinfo/classifier4j-devel |