[Classifier4j-devel] Bayesian Case Study
Status: Beta
Brought to you by:
nicklothian
From: Matt C. <MCo...@my...> - 2003-11-14 00:55:28
|
Below are the top 20 words for one of my classified categories. 1) nbsp, this was obviously originally   which will be addressed when we implement an HTML tokenizer. In the meantime however, I believe this is skewing my personal results. 2) "we" see several occurances of useless pronouns in this list. This can be addressed by an improved "stop list". There is evidently an excellent paper written on the top of stop lists aptly named "A stop list for general text" by Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 1989 ISSN:0163- 5840. If anyone has access to this paper, please advise. 3) the dreaded "s" a result no doubt of incorrectly tokenizing possesive nouns and pronouns, contractions etc. Anybody have a good algorithm for handling this? 4) By the match_counts on these words, I can see that each occurance of a word in a single document goes to the database. I don't see how this behavior is going to produce the desired result. Atleast in my case. I have run across several papers written about the effects of word frequency on text classification. Anybody have any experience in this area? +-------------+-------------+-------------+ | word | match_count | description | +-------------+-------------+-------------+ | nbsp | 4671 | CPA | | we | 874 | CPA | | our | 595 | CPA | | quickbooks | 478 | CPA | | tax | 417 | CPA | | accounting | 413 | CPA | | business | 346 | CPA | | cpa | 337 | CPA | | line | 320 | CPA | | by | 293 | CPA | | olive | 279 | CPA | | year | 264 | CPA | | s | 255 | CPA | | will | 253 | CPA | | help | 238 | CPA | | bookkeeping | 238 | CPA | | murphy | 231 | CPA | | do | 223 | CPA | | firm | 218 | CPA | | she | 216 | CPA | +-------------+-------------+-------------+ Matt Collier RemoteIT mco...@my... 877-4-NEW-LAN |