RE: [Classifier4j-devel] Bayesian Classification
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-13 04:05:49
|
> > BayesianClassifier classifier = new BayesianClassifier(wds); > double probability=classifier.classify("category","text to be > classified"); > Yep, looks good. > This is functioning fine but most of my probabilities are > either 0.01 or 0.99. > Yes, that is pretty much the way it works. See <http://sourceforge.net/docman/display_doc.php?docid=13648&group_id=63137> for why this is. POPFile uses logarithms to get around this (which is actually quite a good idea). Classifier4J uses cut-offs to avoid underflow and overflow. > I saw somewhere in the source that the algorithm is choosing X most > significant words. Is there an easy way for me to determine > what these words > are on a category by category basis? > No, you can't do this on a category by category basis. I haven't found it makes a big difference anyway, so I'd test changing this setting on a single category before you spend a lot of time changing this. > I think the fact that I still have html in my training data > is causing me > difficulty. Does this sound right? > It depends on the task. In your typical Spam classifier HTML is an important indicator. I use Classifier4J for classifying RSS feeds and I don't strip HTML (That's not to say it wouldn't work better if I did - I just haven't tried it). In a lot of cases the HTML supplies a surprising amount of useful data which Classifier4J can use. > As for the issues with the Bayesian tokenizer: > ---- correspondence between Pete and Nick on 2003-08-09 > Pete> Look into the current Tokenizer - For example, "1.4" > currently gets > split into "1" and "4". Shouldn't it just be "1.4"? Also > "peter's" is split > into "peter" and "s". Shouldn't this be "peter's"? It's > probably worth coming > up with a set of test cases. > > Nick> Yes, that needs fixing. Also, I'm not sure about how to > deal with URLs: > at the moment http://www.google.com/something gets split up, > but I think it > probably shouldn't (?) > ---- > > Is this still outstanding? > Yes, these points are outstanding. > How sophicated is this Bayesian classifier when compared with > POPFile or > SpamAssassin? There is some intersting reading about the > POPFile engine at : > http://sourceforge.net/docman/?group_id=63137 > SpamAssassin isn't a Bayesian filter - it runs rules on mail headers and filters mail like that (AFAIK?) Classifier4J is a (almost) pure, naive Bayesian classifier (POPFile is also "naive" in the technical meaning of the word - it treats each word as being independent). It implements Bayes theorem with very little variation (you can specify to only use the X most significant words, and it uses cut-offs to avoid arithmetic underflow). I'm more interested in investigating a Vector-Space classifier than investing a huge amount of time modifying the Bayesian algorithm, especially since the modifications most people do are aimed at detecting Spam, which isn't my core goal with C4J. OTOH, if someone can suggest a change that improves performance or gives some other tangible gain then I'm interested. > POPFile has been designed to classify emails into "buckets" > or categories. > Evidently, there are some mathematical shortcuts if you're > trying to classify > a message against several different categories. > Can you point me at them? I didn't see them in that doco in the POPFile project. > One critial point made in POPfile is that words NOT in a > document may be as > important as the words that ARE in a document. Does the c4J > take this into > account? > I'm not quite sure what you (or they?) mean here. Can you point me at what they say? Classifier4J does (kind of) take words that are not in a document into account. If a particualar word (say "Java") isn't in a document then the document won't get the score-boost of having that word in there. Nick |