RE: [Classifier4j-devel] Bayesian Case Study
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <nl...@es...> - 2003-11-14 01:22:40
|
> > 4) By the match_counts on these words, I can see that each > occurance of a word > in a single document goes to the database. I don't see how > this behavior is > going to produce the desired result. Atleast in my case. I > have run across > several papers written about the effects of word frequency on text > classification. Anybody have any experience in this area? > Are you saying that a document that contains the work "tax" twice addes it twice to the database? This is correct. Logically, a document that contains the same word multiple times is "more about" that word. As a general point I'm not sure you are really going to find Bayesian classification a great match for deciding what kind of a document something is, simply because I don't think you can fairly compare the scores documents get in various categories and say if a score is higher in one than the other it is a better match. For instance, if you have two categories (say Tax and Investments), then you can't say that the word "Tax" in a document means that it is not about "Investments". However, most people use Bayesian classification for simple boolean Match/Not Match (eg Spam/Not Spam) matching. In that case there are certian words that you almost never want to see in matching records (eg - that pill that starts with a V but I won't name in order to avoid setting off everyone's spam filters) Have you looked at Vector Space algorithms? <http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt> and <http://www.perl.com/lpt/a/2003/02/19/engine.html>. I'd love to have enough time to implement one of these properly.... |