Thread: [Classifier4j-devel] Bayesian Case Study

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Below are the top 20 words for one of my classified categories.

1) nbsp, this was obviously originally &nbsp which will be addressed when we 
implement an HTML tokenizer.  In the meantime however, I believe this is 
skewing my personal results.

2) "we" see several occurances of useless pronouns in this list.  This can be 
addressed by an improved "stop list".  There is evidently an excellent paper 
written on the top of stop lists aptly named "A stop list for general text" by 
Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 1989 ISSN:0163-
5840.  If anyone has access to this paper, please advise.

3) the dreaded "s" a result no doubt of incorrectly tokenizing possesive nouns 
and pronouns, contractions etc.  Anybody have a good algorithm for handling 
this?

4) By the match_counts on these words, I can see that each occurance of a word 
in a single document goes to the database.  I don't see how this behavior is 
going to produce the desired result.  Atleast in my case.  I have run across 
several papers written about the effects of word frequency on text 
classification.  Anybody have any experience in this area?

+-------------+-------------+-------------+
| word        | match_count | description |
+-------------+-------------+-------------+
| nbsp        |        4671 | CPA         |
| we          |         874 | CPA         |
| our         |         595 | CPA         |
| quickbooks  |         478 | CPA         |
| tax         |         417 | CPA         |
| accounting  |         413 | CPA         |
| business    |         346 | CPA         |
| cpa         |         337 | CPA         |
| line        |         320 | CPA         |
| by          |         293 | CPA         |
| olive       |         279 | CPA         |
| year        |         264 | CPA         |
| s           |         255 | CPA         |
| will        |         253 | CPA         |
| help        |         238 | CPA         |
| bookkeeping |         238 | CPA         |
| murphy      |         231 | CPA         |
| do          |         223 | CPA         |
| firm        |         218 | CPA         |
| she         |         216 | CPA         |
+-------------+-------------+-------------+

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN

Thread: [Classifier4j-devel] Bayesian Case Study

classifier4j-devel