RE: [Classifier4j-devel] How to Classify Subject Field with defaultStopWords.txt
Status: Beta
Brought to you by:
nicklothian
From: <br...@bj...> - 2004-07-18 05:13:35
|
> You MUST teach non matches as well as matches - otherwise you > will get the results you are currently getting. > > With most spam-type filters, you have a set of "spam" (which > is used to train spam matches), and a set of normal mail (or > "ham") which is used to train non-matches. I must say that I ran into this exact problem when I first used C4J. I did my spam classification and was suprised when it marked everything as spam. I sent a message to this list.. and someone (most likely Nick.. hehe) informed me that you need to sample both match and non-match messages. I dont know what project you are using this for.. I was working on an email spam classifier. So here's what I did: I exported two sets of messages.. spam and non-spam. Then I wrote a class that reads a directory for a set of mbox style messages. From there it parsed them separating out the subject and body. Then it tokenized the messages into whitespace separated words and reformed them into a string. I then ran teachMatch and teachNonMatch on them depending on the known message type (spam or not spam). Im not sure tokenizing and reforming is really needed since I think C4J does that internally anyway (in some form or fashion). Anyways... it seems to work pretty well :) It's not as good as SpamBayes.. but thats only because Ive been teaching SpamBayes much longer than C4J. Im actually thinking of writing a program to read the SpamBayes database and insert the necessary data into the C4J database. I've just been having problems exporting the SpamBayes database into something useable (damn Python). - Brent |