[Classifier4j-devel] Bayesian Classification

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

First, to make dsure I'm implementing this properly, here's what I'm doing:

BayesianClassifier classifier = new BayesianClassifier(wds);
double probability=classifier.classify("category","text to be classified");

This is functioning fine but most of my probabilities are either 0.01 or 0.99.

I saw somewhere in the source that the algorithm is choosing X most 
significant words.  Is there an easy way for me to determine what these words 
are on a category by category basis?

I think the fact that I still have html in my training data is causing me 
difficulty.  Does this sound right?

As for the issues with the Bayesian tokenizer:
---- correspondence between Pete and Nick on 2003-08-09
Pete> Look into the current Tokenizer - For example, "1.4" currently gets 
split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split 
into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming
up with a set of test cases.

Nick> Yes, that needs fixing. Also, I'm not sure about how to deal with URLs: 
at the moment http://www.google.com/something gets split up, but I think it 
probably shouldn't (?)
----

Is this still outstanding?

How sophicated is this Bayesian classifier when compared with POPFile or 
SpamAssassin?  There is some intersting reading about the POPFile engine at :
http://sourceforge.net/docman/?group_id=63137

POPFile has been designed to classify emails into "buckets" or categories.  
Evidently, there are some mathematical shortcuts if you're trying to classify 
a message against several different categories.

One critial point made in POPfile is that words NOT in a document may be as 
important as the words that ARE in a document.  Does the c4J take this into 
account?

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN