|
From: Soheila M. <mas...@gm...> - 2015-05-31 09:15:20
|
Dear Weka List I'm using infogain as attribute selection to get 1000 features and then do some text classification using java api. If I do feature selection on the entire corpus, there's no problem. but, if I want to do the selection 'just on the train set', I get the ArrayIndexOutOfBoundsException error when evaluating the classifier model. (I'm guessing it's because of different vocabulary set in test and train sets). here's my code: //split corpus to test and train instances ... //attribute selection just for train setAttributeSelection attrSelection = new AttributeSelection();Ranker ranker = new Ranker();ranker.setNumToSelect(1000);InfoGainAttributeEval infoGainAttrEval = new InfoGainAttributeEval();attrSelection.setEvaluator(infoGainAttrEval);attrSelection.setSearch(ranker);attrSelection.setInputFormat(train);train = Filter.useFilter(train, attrSelection); //doing the classificationClassifier cModel = new NaiveBayes(); cModel.buildClassifier(train); Evaluation eTest = new Evaluation(train); eTest.evaluateModel(cModel, test); //the ArrayIndexOutOfBoundsException error so how can I fix this? 1. Is there a way to do this without performing the attribution selection for both train and test set? 2. infogain is a supervised attribute selection and uses the class labels to do its job, so is it alright to perform it on the test set? (isn't the class labels of the test set only for evaluating the performance of classification model?) |