These are the instructions on how to use Simple Classifier from the command line. First you need to download Simple Classifier from its project page on Source Forge - https://sourceforge.net/p/simpleclassify/. Once you have downloaded Simple Classifier zip, extract it into a directory. Let's say the path to the directory where Simple Classifier was extracted is $SCWD.
Navigate your terminal/command prompt to $SCWD. We will be issuing commands here.
For all following steps, you need to add simpleclassifier.jar to your classpath.
Let's assume you are trying to classify between two classes - 'positive' and 'negative'. Create two files, one for each class. In the first file, say positive.txt, add all the positive cases, one per line. Similarly, in the second file, say negative.txt, add all the negative cases, one per line. If your text has multiple paragraphs, they will have to be concatenated so that all text appears in one line (possibly a very long line!).
Issue the following command to run SimpleClassifier cross validator for two classes -
java edu.uwm.bionlp.simpleclassifier.MultiClassCrossValidator positive.txt negative.txt 100 1 weka.classifiers.functions.SMO mutual_information 10
In the above command, the first and the second arguments (positive.txt and negative.txt) are the input files. All other arguments are optional.
The third argument is the number of top features to use for training. In the example above, we are training on top 100 features. Default value is 100.
The fourth argument (1) is the type of n-grams to use for training. Here we are using unigrams (hence, 1) which means we are training on individual words. If we wanted to train on unigrams and bigrams, we would use 1,2 as the argument. Default value is 1.
The fifth argument is the Weka class we are using for training. Default value is weka.classifiers.functions.SMO.
The sixth argument (mutual_information) is the feature selection algorithm to use. We can use either mutual_information or chi_squared. Default value is mutual_information.
The seventh and the last argument (10) is the number of folds to use for cross-validation. Default value is 10, for ten-fold cross validation.
Running the Simple Classifier for more than two classes is very similar to running the Simple Classifier for two classes. Let's assume we are dealing with three classes - 'class1', 'class2', and 'class3'. Just like Simple Classifier for two classes, add cases for each class in a separate file, one case per line, for example, class1.txt, class2.txt, class3.txt. Note that none of the files can have comma (,) in their file name.
Then issue the following command to run SimpleClassifier cross validator for multiple classes -
java edu.uwm.bionlp.simpleclassifier.MultiClassCrossValidator class1.txt,class2.txt,class3.txt class1_name,class2_name,class3_name 100 1 weka.classifiers.functions.SMO mutual_information 10
Just like in case of Simple Classifier for two classes, Simple Classifier for multiple classes needs a minimum of two arguments. The remaining arguments are the same as the one for two classes, and are optional.
The first argument (class1.txt,class2.txt,class3.txt) contains the path to the files containing the cases for each file, separated by commas.
The second argument (class1_name,class2_name,class3_name) contains the corresponding class names separated by commas.
Arguments 3-7 are optional and similar to the ones used for running simple classifier for two classes.