RE: [Classifier4j-devel] How to Classify Subject Field with defaultStopWords.txt

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> You MUST teach non matches as well as matches - otherwise you 
> will get the results you are currently getting.
> 
> With most spam-type filters, you have a set of "spam" (which 
> is used to train spam matches), and a set of normal mail (or 
> "ham") which is used to train non-matches.

I must say that I ran into this exact problem when I first used
C4J.  I did my spam classification and was suprised when it marked
everything as spam.  I sent a message to this list.. and someone
(most likely Nick.. hehe) informed me that you need to sample
both match and non-match messages.

I dont know what project you are using this for.. I was working
on an email spam classifier.  So here's what I did:

I exported two sets of messages.. spam and non-spam.  Then
I wrote a class that reads a directory for a set of
mbox style messages.  From there it parsed them separating
out the subject and body.  Then it tokenized the messages
into whitespace separated words and reformed them into a
string.  I then ran teachMatch and teachNonMatch on them
depending on the known message type (spam or not spam).

Im not sure tokenizing and reforming is really needed since
I think C4J does that internally anyway (in some form or fashion).

Anyways... it seems to work pretty well :)  It's not as good as
SpamBayes.. but thats only because Ive been teaching SpamBayes
much longer than C4J.  Im actually thinking of writing a program
to read the SpamBayes database and insert the necessary data
into the C4J database.  I've just been having problems exporting
the SpamBayes database into something useable (damn Python).

- Brent