[Classifier4j-devel] Bayesian Classification
Status: Beta
Brought to you by:
nicklothian
|
From: Matt C. <MCo...@my...> - 2003-11-13 03:13:04
|
First, to make dsure I'm implementing this properly, here's what I'm doing:
BayesianClassifier classifier = new BayesianClassifier(wds);
double probability=classifier.classify("category","text to be classified");
This is functioning fine but most of my probabilities are either 0.01 or 0.99.
I saw somewhere in the source that the algorithm is choosing X most
significant words. Is there an easy way for me to determine what these words
are on a category by category basis?
I think the fact that I still have html in my training data is causing me
difficulty. Does this sound right?
As for the issues with the Bayesian tokenizer:
---- correspondence between Pete and Nick on 2003-08-09
Pete> Look into the current Tokenizer - For example, "1.4" currently gets
split into "1" and "4". Shouldn't it just be "1.4"? Also "peter's" is split
into "peter" and "s". Shouldn't this be "peter's"? It's probably worth coming
up with a set of test cases.
Nick> Yes, that needs fixing. Also, I'm not sure about how to deal with URLs:
at the moment http://www.google.com/something gets split up, but I think it
probably shouldn't (?)
----
Is this still outstanding?
How sophicated is this Bayesian classifier when compared with POPFile or
SpamAssassin? There is some intersting reading about the POPFile engine at :
http://sourceforge.net/docman/?group_id=63137
POPFile has been designed to classify emails into "buckets" or categories.
Evidently, there are some mathematical shortcuts if you're trying to classify
a message against several different categories.
One critial point made in POPfile is that words NOT in a document may be as
important as the words that ARE in a document. Does the c4J take this into
account?
Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN
|