Re: [Classifier4j-devel] How to Classify Subject Field with defaultStopWords.txt
Status: Beta
Brought to you by:
nicklothian
|
From: Nick L. <ni...@ma...> - 2004-07-18 03:59:46
|
You don't need to do anything with defaultStopWords - it is=20
automatically used.
You MUST teach non matches as well as matches - otherwise you will get=20
the results you are currently getting.
With most spam-type filters, you have a set of "spam" (which is used to=20
train spam matches), and a set of normal mail (or "ham") which is used=20
to train non-matches.
Nick
Kashif wrote:
> Hi
>
> Filter is working now on black list and white list when I compare the=20
> =93from=94 field.
>
> If I want to apply the filtering on =93subject=94 field (but its giving=
me=20
> 0.5 or 0.99 no matter what subject I use)
>
> At the moment I am doing this:
>
> 1) Transfer each line (which is a single word) of=20
> =93defaultStopWords.txt=94 in an array stopWordListArray[ ]
>
> 2) Then I create another instance of IwordDatasource as (swds) and=20
> ITrainableClassifier as (sclassifier).
>
> 3) I used a for loop to teach match. I know that I should also train=20
> non match as well. But not sure with What?
>
> 4) I was wondering with that does the c4J uses defaultStopWords.txt,=20
> automatically or we have to call the list some how?
>
> Here=92s my code:
>
> IWordsDataSource swds =3D new SimpleWordsDataSource();
>
> ITrainableClassifier sclassifier =3D new BayesianClassifier(swds);
>
> for (int i=3D0; i<stopWordListArray.length; i++) {
>
> sclassifier.teachMatch(stopWordListArray[i]);
>
> }
>
> for (int i=3D0; i<n; i++) {
>
> double result[] =3D new double[n];
>
> result[i] =3D sclassifier.classify(message[i].getSubject());
>
> System.out.println("The Probability of the message no. " + i + " is: "=20
> + result[i] );
>
> }
>
> Thanks heaps for your help
>
|