classifier4j-devel Mailing List for Classifier4J (Page 10)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Below are the top 20 words for one of my classified categories.

1) nbsp, this was obviously originally &nbsp which will be addressed when we 
implement an HTML tokenizer.  In the meantime however, I believe this is 
skewing my personal results.

2) "we" see several occurances of useless pronouns in this list.  This can be 
addressed by an improved "stop list".  There is evidently an excellent paper 
written on the top of stop lists aptly named "A stop list for general text" by 
Chritopher Fox published in ACM SIGIR Forum Volume 24 Issue 2 1989 ISSN:0163-
5840.  If anyone has access to this paper, please advise.

3) the dreaded "s" a result no doubt of incorrectly tokenizing possesive nouns 
and pronouns, contractions etc.  Anybody have a good algorithm for handling 
this?

4) By the match_counts on these words, I can see that each occurance of a word 
in a single document goes to the database.  I don't see how this behavior is 
going to produce the desired result.  Atleast in my case.  I have run across 
several papers written about the effects of word frequency on text 
classification.  Anybody have any experience in this area?

+-------------+-------------+-------------+
| word        | match_count | description |
+-------------+-------------+-------------+
| nbsp        |        4671 | CPA         |
| we          |         874 | CPA         |
| our         |         595 | CPA         |
| quickbooks  |         478 | CPA         |
| tax         |         417 | CPA         |
| accounting  |         413 | CPA         |
| business    |         346 | CPA         |
| cpa         |         337 | CPA         |
| line        |         320 | CPA         |
| by          |         293 | CPA         |
| olive       |         279 | CPA         |
| year        |         264 | CPA         |
| s           |         255 | CPA         |
| will        |         253 | CPA         |
| help        |         238 | CPA         |
| bookkeeping |         238 | CPA         |
| murphy      |         231 | CPA         |
| do          |         223 | CPA         |
| firm        |         218 | CPA         |
| she         |         216 | CPA         |
+-------------+-------------+-------------+

Matt Collier
RemoteIT
mco...@my...
877-4-NEW-LAN

2003	Jan	Feb	Mar	Apr	May	Jun	Jul (18)	Aug (14)	Sep	Oct	Nov (74)	Dec (9)
2004	Jan (15)	Feb (6)	Mar	Apr	May (27)	Jun (1)	Jul (14)	Aug (3)	Sep (9)	Oct	Nov (3)	Dec (6)
2005	Jan	Feb (2)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec (3)
2006	Jan	Feb (5)	Mar (5)	Apr	May (2)	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2007	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (10)	Sep	Oct (1)	Nov	Dec
2008	Jan	Feb	Mar (1)	Apr (4)	May (1)	Jun (4)	Jul (10)	Aug (5)	Sep (10)	Oct (18)	Nov (39)	Dec (73)
2009	Jan (78)	Feb (24)	Mar (32)	Apr (53)	May (115)	Jun (99)	Jul (72)	Aug (18)	Sep (22)	Oct (35)	Nov (10)	Dec (19)
2010	Jan (6)	Feb (7)	Mar (43)	Apr (55)	May (78)	Jun (71)	Jul (43)	Aug (42)	Sep (19)	Oct (5)	Nov	Dec
2012	Jan	Feb (1)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

classifier4j-devel Mailing List for Classifier4J (Page 10)

classifier4j-devel — Development and use of Classifier4J