Featurama Code
Brought to you by:
qiq
Featurama is a library that implements various sequence-labeling algorithms. Currently Michael Collins' averaged perceptron algorithm is fully implemented. About Featurama --------------- Main highlights: - high performance (implemented in C) - n-gram Hidden Markov Model (n may be any number greater than 1) - perceptron may use 32bit or 64bit coefficients - C-style library interface, easy to use in any programming language (using SWIG or otherwise) - supports pruning of Viterbi states - works on Linux (x86, x64) and Windows (using mingw compiler) Requirements ------------ - GCC 4+, make - autotools (autoconf, automake, libtool) - for Perl wrapper, you need SWIG 1.3.17+ and MakeMaker 0.28+ - for tests, you need wget, unzip, tar, xsltproc and working internet connection License ------- GPLv3 Installation ------------ $ ./configure; make; make check; make install - configure options: --enable-perl enable Perl bindings (if you want to use featurama from Perl, use this) --disable-64bit-alpha use 32bit alpha coefficients for perceptron algorithm (smaller memory footprint and may be faster) --enable-debug turn on debugging support - Win32 installation: To build Featurama for Win32 (*.exe), use MinGW (http://www.mingw.org/) as a compiler, e.g.: $ ./configure --host=mingw32 or $ ./configure --host=i586-mingw32msvc Memory usage and speed tips - disable asserts (-DNDEBUG) File Formats ------------ 1) Data file format Data are stored in tabular format where values are separated by spaces. The first row denotes a header with feature names, all other rows are data, one item (to be labelled) per line. Every column specifies one feature value, the last column is a selected label. In every data row, feature columns are followed by a set of possible labels. Example of POS-tagging input may look like this: Form Prefix1 Suffix1 Num Tag The T e 0 DT DT NNP Arizona A a 0 NNP FW JJ NN NNS RB VB VBP NNP NNPS Form, Prefix1, Suffix1 and Num are feature names, Tag is a label name and other (unlabeled) columns create a set of possible labels the algorithm should choose from (for "The" there are two possible labels: DT and NNP, where DT is the correct one). Sentences are separated by an empty line. This data format is used for making features, training and testing, if you actually want to to run the classifier on the real data, all label (Tag) values are set to "NULL" and the algorithm selects one label for every row. 2) Feature templates file format Features are specified one feature template per line. Every feature contain one or more parts separated by a slash. Each part has the form %[A,B], where A is relative position in a sentence and B is a feature name. Usually, one part of a feature is the label predicted for the current word. For POS-tagging example, a feature template may look for previous two tags and the current one: %[0,Tag]/%[-1,Tag]/%[-2,Tag] Another feature may look at the prefix of the current word: %[0,Tag]/%[0,Prefix2] Complete feature template file for English POS-tagging may look like this: # Template set: Collins, data: WSJ (English) %[0,Tag]/%[0,Form] %[0,Tag]/%[0,Suffix1] %[0,Tag]/%[0,Suffix2] %[0,Tag]/%[0,Suffix3] %[0,Tag]/%[0,Suffix4] %[0,Tag]/%[0,Prefix1] %[0,Tag]/%[0,Prefix2] %[0,Tag]/%[0,Prefix3] %[0,Tag]/%[0,Prefix4] %[0,Tag]/%[0,Num] %[0,Tag]/%[0,Cap] %[0,Tag]/%[0,Dash] %[0,Tag]/%[-1,Tag] %[0,Tag]/%[-1,Tag]/%[-2,Tag] %[0,Tag]/%[-1,Form] %[0,Tag]/%[-2,Form] %[0,Tag]/%[1,Form] %[0,Tag]/%[2,Form] It is also possible to use so-called feature functions to transform values and select index in run-time. Basic Usage ----------- To use the program, you need a feature templates file (test.ft) and train.data + test.data files (in the format described above). First, we use the feature template file and training data together to build real features (feature templates with filled-in values) and a dictionary (string-to-integer translation): $ perc make_features -t test.ft -F test.f -D test.dict train.data Two files are generated: test.f and test.dict. Then, you may start training (in our example 10 iteration): $ perc train -f test.f -d test.dict -i 10 -A test.alpha -v train.data 0 kw processed Iteration: 1, words processed: 6322, accuracy: 80.734 (80.244) Iteration: 2, words processed: 6322, accuracy: 92.233 (92.186) Iteration: 3, words processed: 6322, accuracy: 96.473 (96.441) Iteration: 4, words processed: 6322, accuracy: 98.165 (98.134) Iteration: 5, words processed: 6322, accuracy: 98.244 (98.165) Iteration: 6, words processed: 6322, accuracy: 98.893 (98.861) Iteration: 7, words processed: 6322, accuracy: 99.510 (99.478) Iteration: 8, words processed: 6322, accuracy: 99.304 (99.272) Iteration: 9, words processed: 6322, accuracy: 99.415 (99.383) Iteration: 10, words processed: 6322, accuracy: 99.889 (99.858) The output of this step is test.alpha file, which contains trained coefficients. Now you can simply run testing process on the test data: $ perc test -f test.f -d test.dict -a test.alpha test.data >/dev/null 982 words processed, accuracy: 84.623 (83.707) For running the tagging task itself, just prepare the data in the same format as data.test, but with the correct tag replaced by "NULL". The standard output of perc test command contains a tag for every input word. For additional examples see the test directory that contain scripts for both training and testing. Advanced Usage -------------- It is possible to do some optimization of the whole process: - select number of training iterations (iterations), default is to perform just one iteration. - only keep arbitrary number of Viterbi states for every word (prune), default is to keep all states. - show more than one solution (nbest), default is to show only the best solution. - change order of the n-gram model (order), default is to use 3-grams) - filter features by frequency (min-freq), default is to keep all features, even if they appear just once in the training data. This functionality is selected using switches of perc program. See help running just "perc make_features", "perc train" or "perc test". Feature functions ----------------- It is also possible to create feature functions: C functions that transforms a label in some way. This enables us to use features whose values may not be computed in advance, but they are only know in run-time. There are two types of feature functions: row and column. The former one specifies a row that we look for (A), the later one transforms the label in some way (B). An example of a row feature function is "Look for previous verb tag index" and an example for the column feature function is "Give me a substring of a label". To find out more about feature functions, see lib/feature_functions/* examples. Anyway, most of the time, feature functions are not necessary as most feature values may be computed in advance (such as word prefix or suffix).