RE: [Classifier4j-devel] Bayesian Classification

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> BayesianClassifier classifier = new BayesianClassifier(wds);
> double probability=classifier.classify("category","text to be 
> classified");
> 

Yep, looks good.

> This is functioning fine but most of my probabilities are 
> either 0.01 or 0.99.
> 

Yes, that is pretty much the way it works. See
<http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137>
for why this is.

POPFile uses logarithms to get around this (which is actually quite a good
idea). Classifier4J uses cut-offs to avoid underflow and overflow.

> I saw somewhere in the source that the algorithm is choosing X most 
> significant words.  Is there an easy way for me to determine 
> what these words 
> are on a category by category basis?
> 

No, you can't do this on a category by category basis. I haven't found it
makes a big difference anyway, so I'd test changing this setting on a single
category before you spend a lot of time changing this.

> I think the fact that I still have html in my training data 
> is causing me 
> difficulty.  Does this sound right?
> 

It depends on the task. In your typical Spam classifier HTML is an important
indicator. I use Classifier4J for classifying RSS feeds and I don't strip
HTML (That's not to say it wouldn't work better if I did - I just haven't
tried it).

In a lot of cases the HTML supplies a surprising amount of useful data which
Classifier4J can use.

> As for the issues with the Bayesian tokenizer:
> ---- correspondence between Pete and Nick on 2003-08-09
> Pete> Look into the current Tokenizer - For example, "1.4" 
> currently gets 
> split into "1" and "4". Shouldn't it just be "1.4"? Also 
> "peter's" is split 
> into "peter" and "s". Shouldn't this be "peter's"? It's 
> probably worth coming
> up with a set of test cases.
>  
> Nick> Yes, that needs fixing. Also, I'm not sure about how to 
> deal with URLs: 
> at the moment http://www.google.com/something gets split up, 
> but I think it 
> probably shouldn't (?)
> ----
> 
> Is this still outstanding?
> 

Yes, these points are outstanding.

> How sophicated is this Bayesian classifier when compared with 
> POPFile or 
> SpamAssassin?  There is some intersting reading about the 
> POPFile engine at :
> http://sourceforge.net/docman/?group_id=63137
> 

SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and
filters mail like that (AFAIK?)

Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also
"naive" in the technical meaning of the word - it treats each word as being
independent). It implements Bayes theorem with very little variation (you
can specify to only use the X most significant words, and it uses cut-offs
to avoid arithmetic underflow).

I'm more interested in investigating a Vector-Space classifier than
investing a huge amount of time modifying the Bayesian algorithm, especially
since the modifications most people do are aimed at detecting Spam, which
isn't my core goal with C4J. OTOH, if someone can suggest a change that
improves performance or gives some other tangible gain then I'm interested.

> POPFile has been designed to classify emails into "buckets" 
> or categories.  
> Evidently, there are some mathematical shortcuts if you're 
> trying to classify 
> a message against several different categories.
> 

Can you point me at them? I didn't see them in that doco in the POPFile
project.

> One critial point made in POPfile is that words NOT in a 
> document may be as 
> important as the words that ARE in a document.  Does the c4J 
> take this into 
> account?
> 

I'm not quite sure what you (or they?) mean here. Can you point me at what
they say?

Classifier4J does (kind of) take words that are not in a document into
account. If a particualar word (say "Java") isn't in a document then the
document won't get the score-boost of having that word in there.

Nick