Menu

SGMWeka Documentation v.1.4.9

Antti Puurula

What is SGMWeka?

SGMWeka is a Weka package wrapper for SGM, a tidy Java toolkit for sparse generative models. SGM implements probabilistic generative models of count data using sparse matrix representation of parameters, with hash tables for estimation and inverted indices for posterior inference. This enables scalable modeling, with processing complexities depending on the sparsity of the data. The toolkit is intended for scalable and high-speed text mining uses, but it can be equally used for other cases of sparse count data. SGM was used in for winning the Kaggle LSHTC4 competition, as well as placing second in the Kaggle WISE competition.

What does SGMWeka provide?

Currently SGMWeka implements Multinomial Naive Bayes, Kernel Density, Centroid and KNN classifiers, as well as many of the current ranking functions for ad-hoc document retrieval. These provide high-performing solutions for classification and ranking, but can also be used for other tasks, such as clustering. A number of further modifications for the models are available, including TF-IDF feature transforms, smoothing, parameter pruning and model-based feedback.

Outside the Weka interface a number of additional functions are available, either through command-line calls or class inclusion of SGM.java into Java projects. Multi-label classification and ranked retrieval is supported, as well as stream training and a number of evaluation functions.

How to use in Weka?

Use Weka package manager to install the latest version. Use a StringToWordVector filter to process text data into count vectors, with the outputWordCounts=True option. The default classifier options implement a Multinomial Naive Bayes with TF-IDF that should provide high classification accuracy with most text datasets.

The archive weka_cfgs_1.4.9.zip contains 11 classifier configurations for Weka. One of these is a baseline using Weka TF-IDF and NaiveBayesMultinomial. Four are LibLinear models with TF-IDF, and remaining six are SGMWeka models. The first SGM model is MNB with TF-IDF and model-based feedback (MNB_TI_FB). The second one is Tied Document Mixture, a multinomial kernel density classifier with hierarchical smoothing. Third one is a parameter-free version of TDM, that uses locally averaged Kneser-Ney and Witten-Bell estimates for smoothing (PTDM). Fourth one is Weka Bagging with 50% sample sizes and 16 PTDM sub-models (PTDM_BAG). Fifth one is a Cosine-distance Centroid Classifier with TF-IDF (VSM_TI), and the sixth one is a KNN with Cosine distances and TF-IDF. Three voting ensemble configurations are also provided, using all models (vote_11), 7 of the individually best-performing models (vote_7), and only the LibLinear models (vote_4). Using the datasets from http://web.ist.utl.pt/~acardoso/datasets/, the example configurations provide the Accuracies:

Classifier R8 R52 WebKb 20Ng Cade12 Mean
Weka_MNB_TI 93.15 87.38 84.03 83.25 58.06 81.17
MNB_TI_FB 95.93 92.29 85.89 83.95 58.90 83.39
TDM 96.35 91.94 80.52 83.67 61.43 82.78
PTDM 96.85 91.67 82.09 83.58 57.57 82.35
PTDM_BAG 97.03 91.74 84.31 83.28 57.68 82.81
VSM_TI 95.52 92.25 85.75 78.20 51.69 80.68
KNN_TI 92.87 88.86 84.67 81.30 50.36 79.61
LR_L2R_TI 97.67 94.43 90.83 84.87 58.17 85.19
LR_L1R_TI 96.35 93.26 91.05 80.91 59.85 84.28
SVM_L2R_TI 97.35 95.02 90.11 84.05 53.09 83.92
SVM_L1R_TI 96.98 94.86 91.05 81.88 57.81 84.52
- - - - - - -
vote11 97.40 94.63 89.18 85.61 62.28 85.82
vote7 97.85 95.02 91.33 85.71 61.49 86.28
vote4 97.35 94.94 91.33 84.25 58.55 85.28

How to use outside Weka?

SGMWeka includes the SGM subdirectory at \src\main\java\weka\classifiers\bayes\SGM , containing the SGM Java code. This can be compiled into java .class files without Weka as well. SGM_Tests.java is a test script that performs a number of functions based on arguments, including single- and multi-label classification and ranked retrieval. Compiling SGM_Tests produces a program that can be directly used for these functions, using word vector files in the LIBSVM data format.

Aside from command line calls, SGM can be used by simple Java class inclusion into projects. SGM.java is the main class, while the other classes store parameters and required data structures. The SGM.java functions train_model_libsvm(String train_file) and infer_posterior(int[] terms, float[] counts) provide a simple interface for including SGM into projects. SGM_Tests.java gives an example program for using SGM. The Weka wrapper class SparseGenerativeModel.java is another example of use in a Weka Java project.

How to get the text data?

For a quick start with Weka, download the preprocessed .arff files: http://sourceforge.net/projects/sgmweka/files/arff_datasets.zip . Raw text data comes in various formats, most commonly in XML and plain text. Preprocessed text datasets for classification are mostly in two formats, in .arff files for Weka and LIBSVM sparse .txt format for many other classification uses. SGMWeka supports .arff data with the Weka interface and the LIBSVM format with the SGM_Tests.java program. A set of Python scripts are provided for preparing text datasets to the LIBSVM format: http://sourceforge.net/projects/sgmweka/files/preprocessing_scripts.zip . Using the scripts on publicly available datasets produces the LIBSVM feature files: http://sourceforge.net/projects/sgmweka/files/text_datasets.zip .

Dataset Source http L Train docs Terms/doc Dev. fold Eval. docs
TREC06 http://plg.uwaterloo.ca/~gvcormac/treccorpus06/ 2 35039 106.6 1x1000 2783
ECUE1 http://www.dit.ie/computing/staff/sjdelany/datasets/ 2 9978 186.1 5x200 1000
ECUE2 http://www.dit.ie/computing/staff/sjdelany/datasets/ 2 10865 144.1 5x200 1000
ACL-IMDB http://ai.stanford.edu/~amaas/data/sentiment/ 2 47000 136.2 1x2000 3000
TripAdvisor12 http://times.cs.uiuc.edu/~wang296/Data/ 2 60298 105.4 1x4999 10077
Amazon12 http://times.cs.uiuc.edu/~wang296/Data/ 2 267875 30.7 1x9998 100556
R8 http://web.ist.utl.pt/~acardoso/datasets/ 8 2785 77.1 5x200 1396
R52 http://web.ist.utl.pt/~acardoso/datasets/ 52 5485 41.2 5x200 2189
WebKb http://web.ist.utl.pt/~acardoso/datasets/ 4 6532 43.1 5x200 2568
20Ng http://web.ist.utl.pt/~acardoso/datasets/ 20 11293 84.3 5x200 7528
Cade http://web.ist.utl.pt/~acardoso/datasets/ 12 27322 62.3 5x200 13661
RCV1-v2-Ind http://daviddlewis.com/resources/testcollections/rcv1 19587 343117 22.4 1x1000 8644
EUR-Lex http://www.ke.tu-darmstadt.de/resources/eurlex 14240 17381 270.3 1x1000 1933
OHSU-TREC http://trec.nist.gov/data/t9_filtering.html 196415 197555 40.1 1x1000 35890

The first 3 datasets are for spam classification, next 3 for sentiment analysis, next 5 for single-label multi-class classification, and the last 3 for large-scale multi-label classification. These files can be used for experiments using SGM_Tests, Liblinear, and other toolkits that can read LIBSVM feature files. For reference and comparison of some standard classifiers and SGMWeka on these files, see (3,4,7). The OHSU-TREC file is also processed into a ranked retrieval dataset, that can be used with SGM_Tests for testing ranking functions.

How to build ensembles?

SGM includes three programs http://sourceforge.net/projects/sgmweka/files/ensemble_scripts.zip for optimizing high-performing ensemble solutions:

Metaopt3.py does continuous optimization of program calls using a random search, such as runs of SGMTests.java. It uses a parallellized Gaussian random search algorithm with decreasing step-sizes and multiple best points. It supports constraining, transforming and fixing of subsets of features. As long as the number of parameters is small, a random search will optimize any performance measure, including non-smooth functions with multiple modes.

SelectClassifiers3.py does discrete optimization, such as classifier or feature selection for an ensemble model. It uses a parallelized hill-climbing Tabu search, with options for L0-regularization, multiple steps, and greedy local search. It uses the same basic Python scheduler as Metaopt3.py for parallelization.

MetaComb5.java does ensemble combination using an efficient variant of Feature-Weighted Linear Stacking. This uses metafeatures such as the correlations of base-classifier outputs to predict a vote weight for each baseclassifier. Task-specific metafeatures and optimization measure can be implemented to give the best performance on the specific task.

Currently these programs are less documented, and need modification of source code for actual use, aside from Metaopt3.py which is configured using a template file. The programs run on Linux and Cygwin/Windows, with no dependencies aside from Weka for MetaComb5.java.

License

Apache 2.0 Waikato University

References:

(1) Puurula, A. Scalable Text Classification with Sparse Generative Modeling. Proceedings of the 12th Pacific Rim International Conference on Artificial Intelligence. 2012
(2) Puurula, A. and Bifet, A. Ensembles of Sparse Multinomial Classifiers for Scalable Text Classification. ECML/PKDD PASCAL Workshop on Large-Scale Hierarchical Classification. 2012
(3) Puurula, A. Combining Modifications to Multinomial Naive Bayes for Text Classification. Asian Information Retrieval Societies Conference. 2012
(4) Puurula, A. and Myaeng, S. Integrated Instance- and Class-based Generative Modeling for Text Classification. Proceedings of the Australasian Document Computing Symposium. 2013
(5) Puurula, A. Cumulative Progress in Language Models for Information Retrieval. Australasian Language Technology Association Workshop. 2013
(6) Puurula, A. and Read, J. and Bifet, A. Kaggle LSHTC4 Winning Solution. 2014
(7) Puurula, A. Scalable Text Mining with Sparse Generative Models. PhD Thesis, Waikato University. 2015

Options in Weka

Weka GUI allows selection of a number of options for classification with SGM.
These have matching command-line options with SGM_Tests.

  • useTFIDF -use_tfidf <int>
    Use TFIDF: 0 = no feature transform, 1 = TF-IDF, 2 = TF, 3 = IDF</int>

  • combination -combination <float>
    Instance score combination: 1 = kernel density, 0 = voting, -1 = distance-weighted voting</float>

  • pruneCountInsert -prune_count_insert <float>
    Log-count pruning value of conditional parameters after each update. If used, typical values -6 to -10</float>

  • pruneCountTable -prune_count_table <float>
    Log-count pruning value of conditional parameters after training</float>

  • idfLift -idf_lift <float>
    IDF normalization parameter. Higher values for weaker IDF normalization. -1 = Croft-Harper IDF, 0 = Robertson-Walker IDF</float>

  • bgUnifSmooth -bg_unif_smooth <float>
    Uniform smoothing for the background model. 0= unsmoothed background model, 1= uniform background model</float>

  • feedbackWeight -feedback <float>
    Feedback model interpolation weight for model-based feedback</float>

  • topK -top_k <int>
    Top k instances for inference with kernel densities. Also top k results for model-based feedback</int>

  • minCount -min_count <int>
    Minimum document frequency of term after training. 1 = no terms pruned</int>

  • kernelJelinekMercer -kernel_jelinek_mercer <float>
    Jelinek-Mercer smoothing of instance-conditionals with the class-conditionals</float>

  • kernelDirichletPrior -kernel_dirichlet_prior <float>
    Dirichlet prior smoothing of instance-conditionals with the class-conditionals</float>

  • kernelPowerlawDiscount -kernel_powerlaw_discount <float>
    Power-law discount smoothing of instance-conditionals with the class-conditionals</float>

  • lengthScale -length_scale <float>
    TF length normalization parameter. Higher values for stronger length normalization</float>

  • jelinek_mercer -jelinek_mercer <float>
    Jelinek-Mercer smoothing of class-conditionals with the background model</float>

  • dirichlet_prior -dirichlet_prior <float>
    Dirichlet prior smoothing of class-conditionals with the background model</float>

  • absolute_discount -absolute_discount <float>
    Absolute discount smoothing of class-conditionals with the background model</float>

  • powerlaw_discount -powerlaw_discount <float>
    Power-law discount smoothing of class-conditionals with the background model</float>

  • priorScale -prior_scale <float>
    Scaling of prior probabilities. Equivalent to language model scaling in HMM speech recognition</float>

  • kernelDensities -kernel_densities
    Use instances for inference. Implements kernel densities, or KNN if topK and combination options are specified

  • localPD -local_pd
    Use locally averaged Kneser-Ney estimates for power-law discounting parameter

  • localDP -local_dp
    Use locally averaged Witten-Bell estimates for Dirichlet prior parameter

  • condScale -cond_scale <float>
    Scale conditional parameters after normalization.</float>

  • condNorm -cond_norm <float>
    Norm of conditional parameter vectors after normalization, negative for exponentiated parameters. 1.0= multinomial, -2.0= cosine</float>

  • noSmoothing -no_smoothing
    No smoothing applied. With sparse data use only with a condNorm<0, to avoid log(0)

  • poolBackoffs -pool_backoffs
    Form estimates for smoothing backoff-nodes by pooling counts, without L1-normalization of counts

Additional options with SGM_Tests.java

  • -workdir <string>
    Work directory for the data files</string>

  • -train_file <string>
    File for gathering statistics for model estimation</string>

  • -load_model <string>
    Load a saved model. Aggregates statistics if train_file is specified</string>

  • -test_file <string>
    File for evaluating a model</string>

  • -save_model <string>
    Save model to file</string>

  • -results_file <string>
    Print evaluation results to file, instead of stdout</string>

  • -batch_size <int>
    Number of instances to process in each batch of model training</int>

  • -cond_hashsize <int>
    Size of the conditional hash table. Maximum number of conditional parameters to store, fixed to 10000000 in the Weka wrapper</int>

  • -label_threshold <float>
    Pruning threshold for max score pruning of labels in inference. If used, values closer to 0 do more pruning</float>

  • -max_retrieved <int>
    Maximum number of labels to return. For 1 single-label inference is performed, with >1 more labels are returned in ranked order</int>

  • -label_powerset
    Use the powerset method for multi-label classification. Encodes all encountered labelsets with a class identifier,
    converts identifiers back to labelsets after classification

  • -use_label_weights
    Use label-weighted training data. Data must be supplied with the weight of each label for each document

  • -no_priors
    Use uniform priors for posterior inference. Can be useful for ranked retrieval and KNN

  • -load_iqf
    Load Inverse Query Frequency weights from a file, for weighting test documents

  • -iqf_lift
    Lift the estimates used in IQF, works exactly like idf_lift for IDF

  • -load_clusters
    Load clusters from a file for cluster-based smoothing of nodes, uses the LSHTC parent-node format

  • -cluster_jelinek_mercer <float>
    Jelinek-Mercer smoothing weight for cluster nodes</float>

  • -rand_seed <int>
    Randomization seed for SGM model.</int>

  • -skip_documents <int>
    Use only every n-th document in training, skip others. Use with -rand_seed to train models from fully separate partitions</int>

Example uses with SGM_Tests

  • Single-label classification using standard options, print results to stdout:
    java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1

  • Single-label classification, use smoothed kernel densities, prune kernel instances to top 50 for the combination:
    java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -kernel_densities -kernel_jelinek_mercer 0.5 -top_k 50

  • Single-label classification, prune the model and change default TF-IDF settings:
    java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 1 -idf_lift 0.5 -length_scale 1.0 -prune_count_table \\-8.0 -prune_count_insert \\-4.0

  • Multi-label classification or ranked retrieval, pruning from top 10 instances to maximum 3, using threshold -0.5:
    java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -max_retrieved 3 -top_k 10 -label_threshold \\-0.5

  • Multi-label classification with the label powerset method:
    java -Xmx2000M SGM_Tests -test_file test.txt -train_file train.txt -label_powerset

  • Save model parameters, without normalization or testing:
    java -Xmx2000M SGM_Tests -train_file train.txt -save_model model.txt -no_normalization

  • Load model parameters, normalize and test:
    java -Xmx2000M SGM_Tests -test_file test.txt -load_model model.txt


MongoDB Logo MongoDB