SGMWeka is a Weka package wrapper for SGM, a tidy Java toolkit for sparse generative models. SGM implements probabilistic generative models of count data using sparse matrix representation of parameters, with hash tables for estimation and inverted indices for posterior inference. This enables scalable modeling, with processing complexities depending on the sparsity of the data. The toolkit is intended for scalable and high-speed text mining uses, but it can be equally used for other cases of sparse count data. SGM was used in for winning the Kaggle LSHTC4 competition, as well as placing second in the Kaggle WISE competition.
Currently SGMWeka implements Multinomial Naive Bayes, Kernel Density, Centroid and KNN classifiers, as well as many of the current ranking functions for ad-hoc document retrieval. These provide high-performing solutions for classification and ranking, but can also be used for other tasks, such as clustering. A number of further modifications for the models are available, including TF-IDF feature transforms, smoothing, parameter pruning and model-based feedback.
Outside the Weka interface a number of additional functions are available, either through command-line calls or class inclusion of SGM.java into Java projects. Multi-label classification and ranked retrieval is supported, as well as stream training and a number of evaluation functions.
Use Weka package manager to install the latest version. Use a StringToWordVector filter to process text data into count vectors, with the outputWordCounts=True option. The default classifier options implement a Multinomial Naive Bayes with TF-IDF that should provide high classification accuracy with most text datasets.
The archive weka_cfgs_1.4.9.zip contains 11 classifier configurations for Weka. One of these is a baseline using Weka TF-IDF and NaiveBayesMultinomial. Four are LibLinear models with TF-IDF, and remaining six are SGMWeka models. The first SGM model is MNB with TF-IDF and model-based feedback (MNB_TI_FB). The second one is Tied Document Mixture, a multinomial kernel density classifier with hierarchical smoothing. Third one is a parameter-free version of TDM, that uses locally averaged Kneser-Ney and Witten-Bell estimates for smoothing (PTDM). Fourth one is Weka Bagging with 50% sample sizes and 16 PTDM sub-models (PTDM_BAG). Fifth one is a Cosine-distance Centroid Classifier with TF-IDF (VSM_TI), and the sixth one is a KNN with Cosine distances and TF-IDF. Three voting ensemble configurations are also provided, using all models (vote_11), 7 of the individually best-performing models (vote_7), and only the LibLinear models (vote_4). Using the datasets from http://web.ist.utl.pt/~acardoso/datasets/, the example configurations provide the Accuracies:
| Classifier | R8 | R52 | WebKb | 20Ng | Cade12 | Mean |
|---|---|---|---|---|---|---|
| Weka_MNB_TI | 93.15 | 87.38 | 84.03 | 83.25 | 58.06 | 81.17 |
| MNB_TI_FB | 95.93 | 92.29 | 85.89 | 83.95 | 58.90 | 83.39 |
| TDM | 96.35 | 91.94 | 80.52 | 83.67 | 61.43 | 82.78 |
| PTDM | 96.85 | 91.67 | 82.09 | 83.58 | 57.57 | 82.35 |
| PTDM_BAG | 97.03 | 91.74 | 84.31 | 83.28 | 57.68 | 82.81 |
| VSM_TI | 95.52 | 92.25 | 85.75 | 78.20 | 51.69 | 80.68 |
| KNN_TI | 92.87 | 88.86 | 84.67 | 81.30 | 50.36 | 79.61 |
| LR_L2R_TI | 97.67 | 94.43 | 90.83 | 84.87 | 58.17 | 85.19 |
| LR_L1R_TI | 96.35 | 93.26 | 91.05 | 80.91 | 59.85 | 84.28 |
| SVM_L2R_TI | 97.35 | 95.02 | 90.11 | 84.05 | 53.09 | 83.92 |
| SVM_L1R_TI | 96.98 | 94.86 | 91.05 | 81.88 | 57.81 | 84.52 |
| - | - | - | - | - | - | - |
| vote11 | 97.40 | 94.63 | 89.18 | 85.61 | 62.28 | 85.82 |
| vote7 | 97.85 | 95.02 | 91.33 | 85.71 | 61.49 | 86.28 |
| vote4 | 97.35 | 94.94 | 91.33 | 84.25 | 58.55 | 85.28 |
SGMWeka includes the SGM subdirectory at \src\main\java\weka\classifiers\bayes\SGM , containing the SGM Java code. This can be compiled into java .class files without Weka as well. SGM_Tests.java is a test script that performs a number of functions based on arguments, including single- and multi-label classification and ranked retrieval. Compiling SGM_Tests produces a program that can be directly used for these functions, using word vector files in the LIBSVM data format.
Aside from command line calls, SGM can be used by simple Java class inclusion into projects. SGM.java is the main class, while the other classes store parameters and required data structures. The SGM.java functions train_model_libsvm(String train_file) and infer_posterior(int[] terms, float[] counts) provide a simple interface for including SGM into projects. SGM_Tests.java gives an example program for using SGM. The Weka wrapper class SparseGenerativeModel.java is another example of use in a Weka Java project.
For a quick start with Weka, download the preprocessed .arff files: http://sourceforge.net/projects/sgmweka/files/arff_datasets.zip . Raw text data comes in various formats, most commonly in XML and plain text. Preprocessed text datasets for classification are mostly in two formats, in .arff files for Weka and LIBSVM sparse .txt format for many other classification uses. SGMWeka supports .arff data with the Weka interface and the LIBSVM format with the SGM_Tests.java program. A set of Python scripts are provided for preparing text datasets to the LIBSVM format: http://sourceforge.net/projects/sgmweka/files/preprocessing_scripts.zip . Using the scripts on publicly available datasets produces the LIBSVM feature files: http://sourceforge.net/projects/sgmweka/files/text_datasets.zip .
The first 3 datasets are for spam classification, next 3 for sentiment analysis, next 5 for single-label multi-class classification, and the last 3 for large-scale multi-label classification. These files can be used for experiments using SGM_Tests, Liblinear, and other toolkits that can read LIBSVM feature files. For reference and comparison of some standard classifiers and SGMWeka on these files, see (3,4,7). The OHSU-TREC file is also processed into a ranked retrieval dataset, that can be used with SGM_Tests for testing ranking functions.
SGM includes three programs http://sourceforge.net/projects/sgmweka/files/ensemble_scripts.zip for optimizing high-performing ensemble solutions:
Metaopt3.py does continuous optimization of program calls using a random search, such as runs of SGMTests.java. It uses a parallellized Gaussian random search algorithm with decreasing step-sizes and multiple best points. It supports constraining, transforming and fixing of subsets of features. As long as the number of parameters is small, a random search will optimize any performance measure, including non-smooth functions with multiple modes.
SelectClassifiers3.py does discrete optimization, such as classifier or feature selection for an ensemble model. It uses a parallelized hill-climbing Tabu search, with options for L0-regularization, multiple steps, and greedy local search. It uses the same basic Python scheduler as Metaopt3.py for parallelization.
MetaComb5.java does ensemble combination using an efficient variant of Feature-Weighted Linear Stacking. This uses metafeatures such as the correlations of base-classifier outputs to predict a vote weight for each baseclassifier. Task-specific metafeatures and optimization measure can be implemented to give the best performance on the specific task.
Currently these programs are less documented, and need modification of source code for actual use, aside from Metaopt3.py which is configured using a template file. The programs run on Linux and Cygwin/Windows, with no dependencies aside from Weka for MetaComb5.java.
Apache 2.0 Waikato University
(1) Puurula, A. Scalable Text Classification with Sparse Generative Modeling. Proceedings of the 12th Pacific Rim International Conference on Artificial Intelligence. 2012
(2) Puurula, A. and Bifet, A. Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification. ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification. 2012
(3) Puurula, A. Combining Modifications to Multinomial Naive Bayes for Text Classification. Asian Information Retrieval Societies Conference. 2012
(4) Puurula, A. and Myaeng, S. Integrated Instance- and Class-based Generative Modeling for Text Classification. Proceedings of the Australasian Document Computing Symposium. 2013
(5) Puurula, A. Cumulative Progress in Language Models for Information Retrieval. Australasian Language Technology Association Workshop. 2013
(6) Puurula, A. and Read, J. and Bifet, A. Kaggle LSHTC4 Winning Solution. 2014
(7) Puurula, A. Scalable Text Mining with Sparse Generative Models. PhD Thesis, Waikato University. 2015
Weka GUI allows selection of a number of options for classification with SGM.
These have matching command-line options with SGM_Tests.
useTFIDF -use_tfidf <int>
Use TFIDF: 0 = no feature transform, 1 = TF-IDF, 2 = TF, 3 = IDF</int>
combination -combination <float>
Instance score combination: 1 = kernel density, 0 = voting, -1 = distance-weighted voting</float>
pruneCountInsert -prune_count_insert <float>
Log-count pruning value of conditional parameters after each update. If used, typical values -6 to -10</float>
pruneCountTable -prune_count_table <float>
Log-count pruning value of conditional parameters after training</float>
idfLift -idf_lift <float>
IDF normalization parameter. Higher values for weaker IDF normalization. -1 = Croft-Harper IDF, 0 = Robertson-Walker IDF</float>
bgUnifSmooth -bg_unif_smooth <float>
Uniform smoothing for the background model. 0= unsmoothed background model, 1= uniform background model</float>
feedbackWeight -feedback <float>
Feedback model interpolation weight for model-based feedback</float>
topK -top_k <int>
Top k instances for inference with kernel densities. Also top k results for model-based feedback</int>
minCount -min_count <int>
Minimum document frequency of term after training. 1 = no terms pruned</int>
kernelJelinekMercer -kernel_jelinek_mercer <float>
Jelinek-Mercer smoothing of instance-conditionals with the class-conditionals</float>
kernelDirichletPrior -kernel_dirichlet_prior <float>
Dirichlet prior smoothing of instance-conditionals with the class-conditionals</float>
kernelPowerlawDiscount -kernel_powerlaw_discount <float>
Power-law discount smoothing of instance-conditionals with the class-conditionals</float>
lengthScale -length_scale <float>
TF length normalization parameter. Higher values for stronger length normalization</float>
jelinek_mercer -jelinek_mercer <float>
Jelinek-Mercer smoothing of class-conditionals with the background model</float>
dirichlet_prior -dirichlet_prior <float>
Dirichlet prior smoothing of class-conditionals with the background model</float>
absolute_discount -absolute_discount <float>
Absolute discount smoothing of class-conditionals with the background model</float>
powerlaw_discount -powerlaw_discount <float>
Power-law discount smoothing of class-conditionals with the background model</float>
priorScale -prior_scale <float>
Scaling of prior probabilities. Equivalent to language model scaling in HMM speech recognition</float>
kernelDensities -kernel_densities
Use instances for inference. Implements kernel densities, or KNN if topK and combination options are specified
localPD -local_pd
Use locally averaged Kneser-Ney estimates for power-law discounting parameter
localDP -local_dp
Use locally averaged Witten-Bell estimates for Dirichlet prior parameter
condScale -cond_scale <float>
Scale conditional parameters after normalization.</float>
condNorm -cond_norm <float>
Norm of conditional parameter vectors after normalization, negative for exponentiated parameters. 1.0= multinomial, -2.0= cosine</float>
noSmoothing -no_smoothing
No smoothing applied. With sparse data use only with a condNorm<0, to avoid log(0)
poolBackoffs -pool_backoffs
Form estimates for smoothing backoff-nodes by pooling counts, without L1-normalization of counts
-workdir <string>
Work directory for the data files</string>
-train_file <string>
File for gathering statistics for model estimation</string>
-load_model <string>
Load a saved model. Aggregates statistics if train_file is specified</string>
-test_file <string>
File for evaluating a model</string>
-save_model <string>
Save model to file</string>
-results_file <string>
Print evaluation results to file, instead of stdout</string>
-batch_size <int>
Number of instances to process in each batch of model training</int>
-cond_hashsize <int>
Size of the conditional hash table. Maximum number of conditional parameters to store, fixed to 10000000 in the Weka wrapper</int>
-label_threshold <float>
Pruning threshold for max score pruning of labels in inference. If used, values closer to 0 do more pruning</float>
-max_retrieved <int>
Maximum number of labels to return. For 1 single-label inference is performed, with >1 more labels are returned in ranked order</int>
-label_powerset
Use the powerset method for multi-label classification. Encodes all encountered labelsets with a class identifier,
converts identifiers back to labelsets after classification
-use_label_weights
Use label-weighted training data. Data must be supplied with the weight of each label for each document
-no_priors
Use uniform priors for posterior inference. Can be useful for ranked retrieval and KNN
-load_iqf
Load Inverse Query Frequency weights from a file, for weighting test documents
-iqf_lift
Lift the estimates used in IQF, works exactly like idf_lift for IDF
-load_clusters
Load clusters from a file for cluster-based smoothing of nodes, uses the LSHTC parent-node format
-cluster_jelinek_mercer <float>
Jelinek-Mercer smoothing weight for cluster nodes</float>
-rand_seed <int>
Randomization seed for SGM model.</int>
-skip_documents <int>
Use only every n-th document in training, skip others. Use with -rand_seed to train models from fully separate partitions</int>
Single-label classification using standard options, print results to stdout:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1
Single-label classification, use smoothed kernel densities, prune kernel instances to top 50 for the combination:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -kernel_densities -kernel_jelinek_mercer 0.5 -top_k 50
Single-label classification, prune the model and change default TF-IDF settings:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -idf_lift 0.5 -length_scale 1.0 -prune_count_table \\-8.0 -prune_count_insert \\-4.0
Multi-label classification or ranked retrieval, pruning from top 10 instances to maximum 3, using threshold -0.5:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 3 -top_k 10 -label_threshold \\-0.5
Multi-label classification with the label powerset method:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -label_powerset
Save model parameters, without normalization or testing:
java -Xmx2000M SGM_Tests -train_file train.txt -save_model model.txt -no_normalization
Load model parameters, normalize and test:
java -Xmx2000M SGM_Tests -test_file test.txt -load_model model.txt