Re: [Classifier4j-devel] How to Classify Subject Field with defaultStopWords.txt
Status: Beta
Brought to you by:
nicklothian
From: Nick L. <ni...@ma...> - 2004-07-18 03:59:46
|
You don't need to do anything with defaultStopWords - it is=20 automatically used. You MUST teach non matches as well as matches - otherwise you will get=20 the results you are currently getting. With most spam-type filters, you have a set of "spam" (which is used to=20 train spam matches), and a set of normal mail (or "ham") which is used=20 to train non-matches. Nick Kashif wrote: > Hi > > Filter is working now on black list and white list when I compare the=20 > =93from=94 field. > > If I want to apply the filtering on =93subject=94 field (but its giving= me=20 > 0.5 or 0.99 no matter what subject I use) > > At the moment I am doing this: > > 1) Transfer each line (which is a single word) of=20 > =93defaultStopWords.txt=94 in an array stopWordListArray[ ] > > 2) Then I create another instance of IwordDatasource as (swds) and=20 > ITrainableClassifier as (sclassifier). > > 3) I used a for loop to teach match. I know that I should also train=20 > non match as well. But not sure with What? > > 4) I was wondering with that does the c4J uses defaultStopWords.txt,=20 > automatically or we have to call the list some how? > > Here=92s my code: > > IWordsDataSource swds =3D new SimpleWordsDataSource(); > > ITrainableClassifier sclassifier =3D new BayesianClassifier(swds); > > for (int i=3D0; i<stopWordListArray.length; i++) { > > sclassifier.teachMatch(stopWordListArray[i]); > > } > > for (int i=3D0; i<n; i++) { > > double result[] =3D new double[n]; > > result[i] =3D sclassifier.classify(message[i].getSubject()); > > System.out.println("The Probability of the message no. " + i + " is: "=20 > + result[i] ); > > } > > Thanks heaps for your help > |