Below are the top 20 words for one of my classified categories.
1) nbsp, this was obviously originally   which will be addressed when we
implement an HTML tokenizer. In the meantime however, I believe this is
skewing my personal results.
2) "we" see several occurances of useless pronouns in this list. This can be
addressed by an improved "stop list". There is evidently an excellent paper
written on the top of stop lists aptly named "A stop list for general text" by
Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 1989 ISSN:0163-
5840. If anyone has access to this paper, please advise.
3) the dreaded "s" a result no doubt of incorrectly tokenizing possesive nouns
and pronouns, contractions etc. Anybody have a good algorithm for handling
this?
4) By the match_counts on these words, I can see that each occurance of a word
in a single document goes to the database. I don't see how this behavior is
going to produce the desired result. Atleast in my case. I have run across
several papers written about the effects of word frequency on text
classification. Anybody have any experience in this area?
+-------------+-------------+-------------+
| word | match_count | description |
+-------------+-------------+-------------+
| nbsp | 4671 | CPA |
| we | 874 | CPA |
| our | 595 | CPA |
| quickbooks | 478 | CPA |
| tax | 417 | CPA |
| accounting | 413 | CPA |
| business | 346 | CPA |
| cpa | 337 | CPA |
| line | 320 | CPA |
| by | 293 | CPA |
| olive | 279 | CPA |
| year | 264 | CPA |
| s | 255 | CPA |
| will | 253 | CPA |
| help | 238 | CPA |
| bookkeeping | 238 | CPA |
| murphy | 231 | CPA |
| do | 223 | CPA |
| firm | 218 | CPA |
| she | 216 | CPA |
+-------------+-------------+-------------+
Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN
|