Thread: [Classifier4j-devel] How to Classify Subject Field with defaultStopWords.txt
Status: Beta
Brought to you by:
nicklothian
From: Kashif <ks...@ai...> - 2004-07-16 08:11:23
|
Hi Filter is working now on black list and white list when I compare the "from" field. If I want to apply the filtering on "subject" field (but its giving me 0.5 or 0.99 no matter what subject I use) At the moment I am doing this: 1) Transfer each line (which is a single word) of "defaultStopWords.txt" in an array stopWordListArray[ ] 2) Then I create another instance of IwordDatasource as (swds) and ITrainableClassifier as (sclassifier). 3) I used a for loop to teach match. I know that I should also train non match as well. But not sure with What? 4) I was wondering with that does the c4J uses defaultStopWords.txt, automatically or we have to call the list some how? Here's my code: IWordsDataSource swds = new SimpleWordsDataSource(); ITrainableClassifier sclassifier = new BayesianClassifier(swds); for (int i=0; i<stopWordListArray.length; i++) { sclassifier.teachMatch(stopWordListArray[i]); } for (int i=0; i<n; i++) { double result[] = new double[n]; result[i] = sclassifier.classify(message[i].getSubject()); System.out.println("The Probability of the message no. " + i + " is: " + result[i] ); } Thanks heaps for your help |
From: Nick L. <ni...@ma...> - 2004-07-18 03:59:46
|
You don't need to do anything with defaultStopWords - it is=20 automatically used. You MUST teach non matches as well as matches - otherwise you will get=20 the results you are currently getting. With most spam-type filters, you have a set of "spam" (which is used to=20 train spam matches), and a set of normal mail (or "ham") which is used=20 to train non-matches. Nick Kashif wrote: > Hi > > Filter is working now on black list and white list when I compare the=20 > =93from=94 field. > > If I want to apply the filtering on =93subject=94 field (but its giving= me=20 > 0.5 or 0.99 no matter what subject I use) > > At the moment I am doing this: > > 1) Transfer each line (which is a single word) of=20 > =93defaultStopWords.txt=94 in an array stopWordListArray[ ] > > 2) Then I create another instance of IwordDatasource as (swds) and=20 > ITrainableClassifier as (sclassifier). > > 3) I used a for loop to teach match. I know that I should also train=20 > non match as well. But not sure with What? > > 4) I was wondering with that does the c4J uses defaultStopWords.txt,=20 > automatically or we have to call the list some how? > > Here=92s my code: > > IWordsDataSource swds =3D new SimpleWordsDataSource(); > > ITrainableClassifier sclassifier =3D new BayesianClassifier(swds); > > for (int i=3D0; i<stopWordListArray.length; i++) { > > sclassifier.teachMatch(stopWordListArray[i]); > > } > > for (int i=3D0; i<n; i++) { > > double result[] =3D new double[n]; > > result[i] =3D sclassifier.classify(message[i].getSubject()); > > System.out.println("The Probability of the message no. " + i + " is: "=20 > + result[i] ); > > } > > Thanks heaps for your help > |
From: <br...@bj...> - 2004-07-18 05:13:35
|
> You MUST teach non matches as well as matches - otherwise you > will get the results you are currently getting. > > With most spam-type filters, you have a set of "spam" (which > is used to train spam matches), and a set of normal mail (or > "ham") which is used to train non-matches. I must say that I ran into this exact problem when I first used C4J. I did my spam classification and was suprised when it marked everything as spam. I sent a message to this list.. and someone (most likely Nick.. hehe) informed me that you need to sample both match and non-match messages. I dont know what project you are using this for.. I was working on an email spam classifier. So here's what I did: I exported two sets of messages.. spam and non-spam. Then I wrote a class that reads a directory for a set of mbox style messages. From there it parsed them separating out the subject and body. Then it tokenized the messages into whitespace separated words and reformed them into a string. I then ran teachMatch and teachNonMatch on them depending on the known message type (spam or not spam). Im not sure tokenizing and reforming is really needed since I think C4J does that internally anyway (in some form or fashion). Anyways... it seems to work pretty well :) It's not as good as SpamBayes.. but thats only because Ive been teaching SpamBayes much longer than C4J. Im actually thinking of writing a program to read the SpamBayes database and insert the necessary data into the C4J database. I've just been having problems exporting the SpamBayes database into something useable (damn Python). - Brent |
From: Nick L. <ni...@ma...> - 2004-07-18 12:02:28
|
> for (int i=0; i<n; i++) { > > double result[] = new double[n]; > > result[i] = > sclassifier.classify(message[i].getSubject()); > > System.out.println("The Probability of the > message no. " + i + " is: " + result[i] ); > > > > } > I suspect his code isn't quite doing what you want it to do, either - the line double result[] = new double[n]; should probably be before the loop.... Nick |