[Wekalist] (no subject)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Weka List

I'm using infogain as attribute selection to get 1000 features and then do
some text classification using java api.
If I do feature selection on the entire corpus, there's no problem. but, if
I want to do the selection 'just on the train set', I get the
ArrayIndexOutOfBoundsException
error when evaluating the classifier model. (I'm guessing it's because of
different vocabulary set in test and train sets).
here's my code:

//split corpus to test and train instances
  ...
//attribute selection just for train setAttributeSelection
attrSelection = new AttributeSelection();Ranker ranker = new
Ranker();ranker.setNumToSelect(1000);InfoGainAttributeEval
infoGainAttrEval = new
InfoGainAttributeEval();attrSelection.setEvaluator(infoGainAttrEval);attrSelection.setSearch(ranker);attrSelection.setInputFormat(train);train
= Filter.useFilter(train, attrSelection);

//doing the classificationClassifier cModel = new NaiveBayes();
cModel.buildClassifier(train);
Evaluation eTest = new Evaluation(train);
eTest.evaluateModel(cModel, test);    //the ArrayIndexOutOfBoundsException error

so how can I fix this?
1. Is there a way to do this without performing the attribution selection
for both train and test set?
2. infogain is a supervised attribute selection and uses the class labels
to do its job, so is it alright to perform it on the test set? (isn't the
class labels of the test set only for evaluating the performance of
classification model?)

[Wekalist] (no subject)

Machine learning software to solve data mining problems

[Wekalist] (no subject)