RE: [Classifier4j-devel] Bayesian Case Study

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> 
> 4) By the match_counts on these words, I can see that each 
> occurance of a word 
> in a single document goes to the database.  I don't see how 
> this behavior is 
> going to produce the desired result.  Atleast in my case.  I 
> have run across 
> several papers written about the effects of word frequency on text 
> classification.  Anybody have any experience in this area?
> 

Are you saying that a document that contains the work "tax" twice addes it
twice to the database?

This is correct. Logically, a document that contains the same word multiple
times is "more about" that word.

As a general point I'm not sure you are really going to find Bayesian
classification a great match for deciding what kind of a document something
is, simply because I don't think you can fairly compare the scores documents
get in various categories and say if a score is higher in one than the other
it is a better match.

For instance, if you have two categories (say Tax and Investments), then you
can't say that the word "Tax" in a document means that it is not about
"Investments".

However, most people use Bayesian classification for simple boolean
Match/Not Match (eg Spam/Not Spam) matching. In that case there are certian
words that you almost never want to see in matching records (eg - that pill
that starts with a V but I won't name in order to avoid setting off
everyone's spam filters)

Have you looked at Vector Space algorithms?
<http://www.mackmo.com/nick/blog/java/?permalink=LatentSemanticIndexing.txt>
and <http://www.perl.com/lpt/a/2003/02/19/engine.html>.

I'd love to have enough time to implement one of these properly....