Featurama Code

Brought to you by: qiq

Tree [r235] /

History

HTTPS access

File	Date	Author	Commit
config	2010-04-12	qiq	[r224] Adding config dir, which is required by autogen...
lib	2011-08-02	qiq	[r230] Replace strtok() usage by strtok_r(), so that l...
libperl	2009-11-13	qiq	[r219] Compatibility fix: remove -W and -C switches fr...
m4	2009-11-13	qiq	[r218] Using INSTALL_BASE instead of PREFIX for module...
src	2013-12-01	qiq	[r235] Fixed include in src/perc.cc, bumped version to...
test	2009-12-10	qiq	[r220] xsltproc is needed for PDT tests.
utils	2011-09-08	qiq	[r233] Fixing tag-csts-cs.sh
AUTHORS	2008-08-30	qiq	[r1] First commit, almost nothing (just wordbuffer).
COPYING	2008-08-30	qiq	[r1] First commit, almost nothing (just wordbuffer).
ChangeLog	2013-12-01	qiq	[r235] Fixed include in src/perc.cc, bumped version to...
INSTALL	2011-02-28	qiq	[r228] Updating for new gcc compilators (using tr1::un...
Makefile.am	2009-05-24	qiq	[r192] Fixing libperl distribution.
NEWS	2008-08-30	qiq	[r1] First commit, almost nothing (just wordbuffer).
NOTES	2008-11-01	qiq	[r68] Added description of Viterbi states
README	2013-12-01	qiq	[r235] Fixed include in src/perc.cc, bumped version to...
autogen.sh	2009-05-23	qiq	[r187] Adding basic Perl support
configure.ac	2013-12-01	qiq	[r235] Fixed include in src/perc.cc, bumped version to...
featurama.pc.in	2009-05-23	qiq	[r184] Adding pkg-config data file.

Read Me

Featurama is a library that implements various sequence-labeling algorithms.
Currently Michael Collins' averaged perceptron algorithm is fully implemented.

About Featurama
---------------

Main highlights:

- high performance (implemented in C)
- n-gram Hidden Markov Model (n may be any number greater than 1)
- perceptron may use 32bit or 64bit coefficients
- C-style library interface, easy to use in any programming language (using SWIG or otherwise)
- supports pruning of Viterbi states
- works on Linux (x86, x64) and Windows (using mingw compiler)

Requirements
------------

- GCC 4+, make
- autotools (autoconf, automake, libtool)
- for Perl wrapper, you need SWIG 1.3.17+ and MakeMaker 0.28+
- for tests, you need wget, unzip, tar, xsltproc and working internet connection

License
-------

GPLv3

Installation
------------

$ ./configure; make; make check; make install

- configure options:
  --enable-perl           enable Perl bindings (if you want to use featurama from Perl, use this)
  --disable-64bit-alpha   use 32bit alpha coefficients for perceptron algorithm
                          (smaller memory footprint and may be faster)
  --enable-debug          turn on debugging support

- Win32 installation:

To build Featurama for Win32 (*.exe), use MinGW (http://www.mingw.org/) as a
compiler, e.g.:

$ ./configure --host=mingw32

or 

$ ./configure --host=i586-mingw32msvc

Memory usage and speed tips

- disable asserts (-DNDEBUG)

File Formats
------------

1) Data file format

Data are stored in tabular format where values are separated by spaces.  The
first row denotes a header with feature names, all other rows are data, one
item (to be labelled) per line. Every column specifies one feature value, the
last column is a selected label. In every data row, feature columns are followed
by a set of possible labels.

Example of POS-tagging input may look like this:

Form    Prefix1 Suffix1 Num     Tag
The     T	e	0       DT      DT      NNP
Arizona A       a       0       NNP     FW      JJ      NN	NNS     RB      VB      VBP     NNP     NNPS

Form, Prefix1, Suffix1 and Num are feature names, Tag is a label name and other
(unlabeled) columns create a set of possible labels the algorithm should
choose from (for "The" there are two possible labels: DT and NNP, where DT is
the correct one). Sentences are separated by an empty line.

This data format is used for making features, training and testing, if you
actually want to to run the classifier on the real data, all label (Tag) values
are set to "NULL" and the algorithm selects one label for every row.

2) Feature templates file format

Features are specified one feature template per line. Every feature contain one
or more parts separated by a slash. Each part has the form %[A,B], where A is
relative position in a sentence and B is a feature name. Usually, one part of a
feature is the label predicted for the current word.

For POS-tagging example, a feature template may look for previous two tags and
the current one:

%[0,Tag]/%[-1,Tag]/%[-2,Tag]

Another feature may look at the prefix of the current word:

%[0,Tag]/%[0,Prefix2]

Complete feature template file for English POS-tagging may look like this:

# Template set: Collins, data: WSJ (English)
%[0,Tag]/%[0,Form]
%[0,Tag]/%[0,Suffix1]
%[0,Tag]/%[0,Suffix2]
%[0,Tag]/%[0,Suffix3]
%[0,Tag]/%[0,Suffix4]
%[0,Tag]/%[0,Prefix1]
%[0,Tag]/%[0,Prefix2]
%[0,Tag]/%[0,Prefix3]
%[0,Tag]/%[0,Prefix4]
%[0,Tag]/%[0,Num]
%[0,Tag]/%[0,Cap]
%[0,Tag]/%[0,Dash]
%[0,Tag]/%[-1,Tag]
%[0,Tag]/%[-1,Tag]/%[-2,Tag]
%[0,Tag]/%[-1,Form]
%[0,Tag]/%[-2,Form]
%[0,Tag]/%[1,Form]
%[0,Tag]/%[2,Form]

It is also possible to use so-called feature functions to transform values and
select index in run-time.

Basic Usage
-----------

To use the program, you need a feature templates file (test.ft) and train.data
+ test.data files (in the format described above).

First, we use the feature template file and training data together to build
real features (feature templates with filled-in values) and a dictionary
(string-to-integer translation):

$ perc make_features -t test.ft -F test.f -D test.dict train.data

Two files are generated: test.f and test.dict. Then, you may start training (in
our example 10 iteration):

$ perc train -f test.f -d test.dict -i 10 -A test.alpha -v train.data

0 kw processed
Iteration: 1, words processed: 6322, accuracy: 80.734 (80.244)
Iteration: 2, words processed: 6322, accuracy: 92.233 (92.186)
Iteration: 3, words processed: 6322, accuracy: 96.473 (96.441)
Iteration: 4, words processed: 6322, accuracy: 98.165 (98.134)
Iteration: 5, words processed: 6322, accuracy: 98.244 (98.165)
Iteration: 6, words processed: 6322, accuracy: 98.893 (98.861)
Iteration: 7, words processed: 6322, accuracy: 99.510 (99.478)
Iteration: 8, words processed: 6322, accuracy: 99.304 (99.272)
Iteration: 9, words processed: 6322, accuracy: 99.415 (99.383)
Iteration: 10, words processed: 6322, accuracy: 99.889 (99.858)

The output of this step is test.alpha file, which contains trained
coefficients. Now you can simply run testing process on the test data:

$ perc test -f test.f -d test.dict -a test.alpha test.data >/dev/null

982 words processed, accuracy: 84.623 (83.707)

For running the tagging task itself, just prepare the data in the same format
as data.test, but with the correct tag replaced by "NULL". The standard output
of perc test command contains a tag for every input word.

For additional examples see the test directory that contain scripts for both
training and testing.

Advanced Usage
--------------

It is possible to do some optimization of the whole process:

- select number of training iterations (iterations), default is to perform just
  one iteration.
- only keep arbitrary number of Viterbi states for every word (prune), default
  is to keep all states.
- show more than one solution (nbest), default is to show only the best
  solution.
- change order of the n-gram model (order), default is to use 3-grams)
- filter features by frequency (min-freq), default is to keep all features,
  even if they appear just once in the training data.

This functionality is selected using switches of perc program. See help running just
"perc make_features", "perc train" or "perc test".

Feature functions
-----------------

It is also possible to create feature functions: C functions that transforms
a label in some way. This enables us to use features whose values may not be
computed in advance, but they are only know in run-time.

There are two types of feature functions: row and column. The former one
specifies a row that we look for (A), the later one transforms the label in
some way (B). An example of a row feature function is "Look for previous verb
tag index" and an example for the column feature function is "Give me a
substring of a label". To find out more about feature functions, see
lib/feature_functions/* examples.

Anyway, most of the time, feature functions are not necessary as most feature
values may be computed in advance (such as word prefix or suffix).

Featurama Code

Tree [r235] / Download Snapshot History

Read Me

Tree [r235] /

History