sgmweka Wiki

Weka wrapper for the SGM toolkit for text classification and modeling.

Brought to you by: as-p

SGMWeka Documentation v.1.4.9

What is SGMWeka?

SGMWeka is a Weka package wrapper for SGM, a tidy Java toolkit for sparse generative models. SGM implements probabilistic generative models of count data using sparse matrix representation of parameters, with hash tables for estimation and inverted indices for posterior inference. This enables scalable modeling, with processing complexities depending on the sparsity of the data. The toolkit is intended for scalable and high-speed text mining uses, but it can be equally used for other cases of sparse count data. SGM was used in for winning the Kaggle LSHTC4 competition, as well as placing second in the Kaggle WISE competition.

What does SGMWeka provide?

Currently SGMWeka implements Multinomial Naive Bayes, Kernel Density, Centroid and KNN classifiers, as well as many of the current ranking functions for ad-hoc document retrieval. These provide high-performing solutions for classification and ranking, but can also be used for other tasks, such as clustering. A number of further modifications for the models are available, including TF-IDF feature transforms, smoothing, parameter pruning and model-based feedback.

Outside the Weka interface a number of additional functions are available, either through command-line calls or class inclusion of SGM.java into Java projects. Multi-label classification and ranked retrieval is supported, as well as stream training and a number of evaluation functions.

How to use in Weka?

Use Weka package manager to install the latest version. Use a StringToWordVector filter to process text data into count vectors, with the outputWordCounts=True option. The default classifier options implement a Multinomial Naive Bayes with TF-IDF that should provide high classification accuracy with most text datasets.

The archive weka_cfgs_1.4.9.zip contains 11 classifier configurations for Weka. One of these is a baseline using Weka TF-IDF and NaiveBayesMultinomial. Four are LibLinear models with TF-IDF, and remaining six are SGMWeka models. The first SGM model is MNB with TF-IDF and model-based feedback (MNB_TI_FB). The second one is Tied Document Mixture, a multinomial kernel density classifier with hierarchical smoothing. Third one is a parameter-free version of TDM, that uses locally averaged Kneser-Ney and Witten-Bell estimates for smoothing (PTDM). Fourth one is Weka Bagging with 50% sample sizes and 16 PTDM sub-models (PTDM_BAG). Fifth one is a Cosine-distance Centroid Classifier with TF-IDF (VSM_TI), and the sixth one is a KNN with Cosine distances and TF-IDF. Three voting ensemble configurations are also provided, using all models (vote_11), 7 of the individually best-performing models (vote_7), and only the LibLinear models (vote_4). Using the datasets from http://web.ist.utl.pt/~acardoso/datasets/, the example configurations provide the Accuracies:

Classifier	R8	R52	WebKb	20Ng	Cade12	Mean
Weka_MNB_TI	93.15	87.38	84.03	83.25	58.06	81.17
MNB_TI_FB	95.93	92.29	85.89	83.95	58.90	83.39
TDM	96.35	91.94	80.52	83.67	61.43	82.78
PTDM	96.85	91.67	82.09	83.58	57.57	82.35
PTDM_BAG	97.03	91.74	84.31	83.28	57.68	82.81
VSM_TI	95.52	92.25	85.75	78.20	51.69	80.68
KNN_TI	92.87	88.86	84.67	81.30	50.36	79.61
LR_L2R_TI	97.67	94.43	90.83	84.87	58.17	85.19
LR_L1R_TI	96.35	93.26	91.05	80.91	59.85	84.28
SVM_L2R_TI	97.35	95.02	90.11	84.05	53.09	83.92
SVM_L1R_TI	96.98	94.86	91.05	81.88	57.81	84.52
-	-	-	-	-	-	-
vote11	97.40	94.63	89.18	85.61	62.28	85.82
vote7	97.85	95.02	91.33	85.71	61.49	86.28
vote4	97.35	94.94	91.33	84.25	58.55	85.28

How to use outside Weka?

SGMWeka includes the SGM subdirectory at \src\main\java\weka\classifiers\bayes\SGM , containing the SGM Java code. This can be compiled into java .class files without Weka as well. SGM_Tests.java is a test script that performs a number of functions based on arguments, including single- and multi-label classification and ranked retrieval. Compiling SGM_Tests produces a program that can be directly used for these functions, using word vector files in the LIBSVM data format.

Aside from command line calls, SGM can be used by simple Java class inclusion into projects. SGM.java is the main class, while the other classes store parameters and required data structures. The SGM.java functions train_model_libsvm(String train_file) and infer_posterior(int[] terms, float[] counts) provide a simple interface for including SGM into projects. SGM_Tests.java gives an example program for using SGM. The Weka wrapper class SparseGenerativeModel.java is another example of use in a Weka Java project.

How to get the text data?

For a quick start with Weka, download the preprocessed .arff files: http://sourceforge.net/projects/sgmweka/files/arff_datasets.zip . Raw text data comes in various formats, most commonly in XML and plain text. Preprocessed text datasets for classification are mostly in two formats, in .arff files for Weka and LIBSVM sparse .txt format for many other classification uses. SGMWeka supports .arff data with the Weka interface and the LIBSVM format with the SGM_Tests.java program. A set of Python scripts are provided for preparing text datasets to the LIBSVM format: http://sourceforge.net/projects/sgmweka/files/preprocessing_scripts.zip . Using the scripts on publicly available datasets produces the LIBSVM feature files: http://sourceforge.net/projects/sgmweka/files/text_datasets.zip .

Dataset	Source http	L	Train docs	Terms/doc	Dev. fold	Eval. docs
TREC06	http://plg.uwaterloo.ca/~gvcormac/treccorpus06/	2	35039	106.6	1x1000	2783
ECUE1	http://www.dit.ie/computing/staff/sjdelany/datasets/	2	9978	186.1	5x200	1000
ECUE2	http://www.dit.ie/computing/staff/sjdelany/datasets/	2	10865	144.1	5x200	1000
ACL-IMDB	http://ai.stanford.edu/~amaas/data/sentiment/	2	47000	136.2	1x2000	3000
TripAdvisor12	http://times.cs.uiuc.edu/~wang296/Data/	2	60298	105.4	1x4999	10077
Amazon12	http://times.cs.uiuc.edu/~wang296/Data/	2	267875	30.7	1x9998	100556
R8	http://web.ist.utl.pt/~acardoso/datasets/	8	2785	77.1	5x200	1396
R52	http://web.ist.utl.pt/~acardoso/datasets/	52	5485	41.2	5x200	2189
WebKb	http://web.ist.utl.pt/~acardoso/datasets/	4	6532	43.1	5x200	2568
20Ng	http://web.ist.utl.pt/~acardoso/datasets/	20	11293	84.3	5x200	7528
Cade	http://web.ist.utl.pt/~acardoso/datasets/	12	27322	62.3	5x200	13661
RCV1-v2-Ind	http://daviddlewis.com/resources/testcollections/rcv1	19587	343117	22.4	1x1000	8644
EUR-Lex	http://www.ke.tu-darmstadt.de/resources/eurlex	14240	17381	270.3	1x1000	1933
OHSU-TREC	http://trec.nist.gov/data/t9_filtering.html	196415	197555	40.1	1x1000	35890

The first 3 datasets are for spam classification, next 3 for sentiment analysis, next 5 for single-label multi-class classification, and the last 3 for large-scale multi-label classification. These files can be used for experiments using SGM_Tests, Liblinear, and other toolkits that can read LIBSVM feature files. For reference and comparison of some standard classifiers and SGMWeka on these files, see (3,4,7). The OHSU-TREC file is also processed into a ranked retrieval dataset, that can be used with SGM_Tests for testing ranking functions.

How to build ensembles?

SGM includes three programs http://sourceforge.net/projects/sgmweka/files/ensemble_scripts.zip for optimizing high-performing ensemble solutions:

Metaopt3.py does continuous optimization of program calls using a random search, such as runs of SGMTests.java. It uses a parallellized Gaussian random search algorithm with decreasing step-sizes and multiple best points. It supports constraining, transforming and fixing of subsets of features. As long as the number of parameters is small, a random search will optimize any performance measure, including non-smooth functions with multiple modes.

SelectClassifiers3.py does discrete optimization, such as classifier or feature selection for an ensemble model. It uses a parallelized hill-climbing Tabu search, with options for L0-regularization, multiple steps, and greedy local search. It uses the same basic Python scheduler as Metaopt3.py for parallelization.

MetaComb5.java does ensemble combination using an efficient variant of Feature-Weighted Linear Stacking. This uses metafeatures such as the correlations of base-classifier outputs to predict a vote weight for each baseclassifier. Task-specific metafeatures and optimization measure can be implemented to give the best performance on the specific task.

Currently these programs are less documented, and need modification of source code for actual use, aside from Metaopt3.py which is configured using a template file. The programs run on Linux and Cygwin/Windows, with no dependencies aside from Weka for MetaComb5.java.

License

Apache 2.0 Waikato University

References:

(1) Puurula, A. Scalable Text Classification with Sparse Generative Modeling. Proceedings of the 12th Pacific Rim International Conference on Artificial Intelligence. 2012
(2) Puurula, A. and Bifet, A. Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification. ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification. 2012
(3) Puurula, A. Combining Modifications to Multinomial Naive Bayes for Text Classification. Asian Information Retrieval Societies Conference. 2012
(4) Puurula, A. and Myaeng, S. Integrated Instance- and Class-based Generative Modeling for Text Classification. Proceedings of the Australasian Document Computing Symposium. 2013
(5) Puurula, A. Cumulative Progress in Language Models for Information Retrieval. Australasian Language Technology Association Workshop. 2013
(6) Puurula, A. and Read, J. and Bifet, A. Kaggle LSHTC4 Winning Solution. 2014
(7) Puurula, A. Scalable Text Mining with Sparse Generative Models. PhD Thesis, Waikato University. 2015

Options in Weka

Weka GUI allows selection of a number of options for classification with SGM.
These have matching command-line options with SGM_Tests.

useTFIDF -use_tfidf <int>
Use TFIDF: 0 = no feature transform, 1 = TF-IDF, 2 = TF, 3 = IDF</int>
combination -combination <float>
Instance score combination: 1 = kernel density, 0 = voting, -1 = distance-weighted voting</float>
pruneCountInsert -prune_count_insert <float>
Log-count pruning value of conditional parameters after each update. If used, typical values -6 to -10</float>
pruneCountTable -prune_count_table <float>
Log-count pruning value of conditional parameters after training</float>
idfLift -idf_lift <float>
IDF normalization parameter. Higher values for weaker IDF normalization. -1 = Croft-Harper IDF, 0 = Robertson-Walker IDF</float>
bgUnifSmooth -bg_unif_smooth <float>
Uniform smoothing for the background model. 0= unsmoothed background model, 1= uniform background model</float>
feedbackWeight -feedback <float>
Feedback model interpolation weight for model-based feedback</float>
topK -top_k <int>
Top k instances for inference with kernel densities. Also top k results for model-based feedback</int>
minCount -min_count <int>
Minimum document frequency of term after training. 1 = no terms pruned</int>
kernelJelinekMercer -kernel_jelinek_mercer <float>
Jelinek-Mercer smoothing of instance-conditionals with the class-conditionals</float>
kernelDirichletPrior -kernel_dirichlet_prior <float>
Dirichlet prior smoothing of instance-conditionals with the class-conditionals</float>
kernelPowerlawDiscount -kernel_powerlaw_discount <float>
Power-law discount smoothing of instance-conditionals with the class-conditionals</float>
lengthScale -length_scale <float>
TF length normalization parameter. Higher values for stronger length normalization</float>
jelinek_mercer -jelinek_mercer <float>
Jelinek-Mercer smoothing of class-conditionals with the background model</float>
dirichlet_prior -dirichlet_prior <float>
Dirichlet prior smoothing of class-conditionals with the background model</float>
absolute_discount -absolute_discount <float>
Absolute discount smoothing of class-conditionals with the background model</float>
powerlaw_discount -powerlaw_discount <float>
Power-law discount smoothing of class-conditionals with the background model</float>
priorScale -prior_scale <float>
Scaling of prior probabilities. Equivalent to language model scaling in HMM speech recognition</float>
kernelDensities -kernel_densities
Use instances for inference. Implements kernel densities, or KNN if topK and combination options are specified
localPD -local_pd
Use locally averaged Kneser-Ney estimates for power-law discounting parameter
localDP -local_dp
Use locally averaged Witten-Bell estimates for Dirichlet prior parameter
condScale -cond_scale <float>
Scale conditional parameters after normalization.</float>
condNorm -cond_norm <float>
Norm of conditional parameter vectors after normalization, negative for exponentiated parameters. 1.0= multinomial, -2.0= cosine</float>
noSmoothing -no_smoothing
No smoothing applied. With sparse data use only with a condNorm<0, to avoid log(0)
poolBackoffs -pool_backoffs
Form estimates for smoothing backoff-nodes by pooling counts, without L1-normalization of counts

Additional options with SGM_Tests.java

-workdir <string>
Work directory for the data files</string>
-train_file <string>
File for gathering statistics for model estimation</string>
-load_model <string>
Load a saved model. Aggregates statistics if train_file is specified</string>
-test_file <string>
File for evaluating a model</string>
-save_model <string>
Save model to file</string>
-results_file <string>
Print evaluation results to file, instead of stdout</string>
-batch_size <int>
Number of instances to process in each batch of model training</int>
-cond_hashsize <int>
Size of the conditional hash table. Maximum number of conditional parameters to store, fixed to 10000000 in the Weka wrapper</int>
-label_threshold <float>
Pruning threshold for max score pruning of labels in inference. If used, values closer to 0 do more pruning</float>
-max_retrieved <int>
Maximum number of labels to return. For 1 single-label inference is performed, with >1 more labels are returned in ranked order</int>
-label_powerset
Use the powerset method for multi-label classification. Encodes all encountered labelsets with a class identifier,
converts identifiers back to labelsets after classification
-use_label_weights
Use label-weighted training data. Data must be supplied with the weight of each label for each document
-no_priors
Use uniform priors for posterior inference. Can be useful for ranked retrieval and KNN
-load_iqf
Load Inverse Query Frequency weights from a file, for weighting test documents
-iqf_lift
Lift the estimates used in IQF, works exactly like idf_lift for IDF
-load_clusters
Load clusters from a file for cluster-based smoothing of nodes, uses the LSHTC parent-node format
-cluster_jelinek_mercer <float>
Jelinek-Mercer smoothing weight for cluster nodes</float>
-rand_seed <int>
Randomization seed for SGM model.</int>
-skip_documents <int>
Use only every n-th document in training, skip others. Use with -rand_seed to train models from fully separate partitions</int>

Example uses with SGM_Tests

Single-label classification using standard options, print results to stdout:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1
Single-label classification, use smoothed kernel densities, prune kernel instances to top 50 for the combination:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -kernel_densities -kernel_jelinek_mercer 0.5 -top_k 50
Single-label classification, prune the model and change default TF-IDF settings:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -idf_lift 0.5 -length_scale 1.0 -prune_count_table \\-8.0 -prune_count_insert \\-4.0
Multi-label classification or ranked retrieval, pruning from top 10 instances to maximum 3, using threshold -0.5:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 3 -top_k 10 -label_threshold \\-0.5
Multi-label classification with the label powerset method:
java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -label_powerset
Save model parameters, without normalization or testing:
java -Xmx2000M SGM_Tests -train_file train.txt -save_model model.txt -no_normalization
Load model parameters, normalize and test:
java -Xmx2000M SGM_Tests -test_file test.txt -load_model model.txt