[Classifier4j-devel] Bayesian Classification
Status: Beta
Brought to you by:
nicklothian
From: Matt C. <MCo...@my...> - 2003-11-13 03:13:04
|
First, to make dsure I'm implementing this properly, here's what I'm doing: BayesianClassifier classifier = new BayesianClassifier(wds); double probability=classifier.classify("category","text to be classified"); This is functioning fine but most of my probabilities are either 0.01 or 0.99. I saw somewhere in the source that the algorithm is choosing X most significant words. Is there an easy way for me to determine what these words are on a category by category basis? I think the fact that I still have html in my training data is causing me difficulty. Does this sound right? As for the issues with the Bayesian tokenizer: ---- correspondence between Pete and Nick on 2003-08-09 Pete> Look into the current Tokenizer - For example, "1.4" currently gets split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming up with a set of test cases. Nick> Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: at the moment http://www.google.com/something gets split up, but I think it probably shouldn't (?) ---- Is this still outstanding? How sophicated is this Bayesian classifier when compared with POPFile or SpamAssassin? There is some intersting reading about the POPFile engine at : http://sourceforge.net/docman/?group_id=63137 POPFile has been designed to classify emails into "buckets" or categories. Evidently, there are some mathematical shortcuts if you're trying to classify a message against several different categories. One critial point made in POPfile is that words NOT in a document may be as important as the words that ARE in a document. Does the c4J take this into account? Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |