A quick reply now (will give a more detailed reply in a day or two)
1. If there are n classes, then there are n features for a particular "property". If you see the definition of a feature, it is a function of both the property that you want to represent and the state/label. For example, say the property is "isCapitalized" and there are 2 classes, then there will be 2 features (with different feature ids)
i) isCapitalized is true and state = 0  ( featureID = 0)
ii) isCapitalized is true and state = 1 ( featureID = 1)

2. If you look at the maximum log likelihood equation in the paper(Shallow Parsing), then you will see that the numerator contains the label sequence of all the training instances, and the denominator contains a normalizing term for all possible label sequences.

For a position (say) x=0, we fire features for all the states in which the feature will be true (in all the FeatureTypes class).

Then in the CRF trainer, while computing the F(Y|X) for a particular label sequence Y, we take only those feature values whose state matches with the state present in that label sequence .

For example, say n=2 [ 0 = other, 1 = NounPhrase], and feature = "isCapitalized", and datasequence = "Today is Thursday ."
Trainig data = Y = [ 1 0 1 0]

We fire 2 features at each position in the sequence. So at pos = 0 where the word is capitalized, , the features we fire are
i) isCapitalized=1 and y=0 (featureId=0)
ii) isCapitalized=1 and y=1 (featureId=1)

Note that the length of the feature vector = 2.

Similarly at the rest of the positions, we fire features by looking if the word at that position is capitalized or not.

Now consider the global feature vector F() for difference possible label sequences
For Y = [ 0 0 0 0]
F(Y|X) = f(y = 0, pos=0) + f(y=0, pos=1) + f(y=0, pos=2) + f(y=0, pos=3)
F(Y|X) = [ 1 0 ] + [ 0 0 ] + [ 1 0 ] + [ 0 0]
F(Y|X) = [ 2 0 ]

For Y = [ 1 0 0 0 ]
F(Y|X) = f(y = 0, pos=0) + f(y=0, pos=1) + f(y=0, pos=2) + f(y=0, pos=3)
F(Y|X) = [ 0 1 ] + [ 0 0 ] + [ 1 0 ] + [ 0 0]
F(Y|X) = [ 1 1 ]

If you look at the equations in the paper, then it is this global feature vector F(,) which is used.
So for each possible label sequence Y, such F(Y|X) needs to be computed, and thus we fire features for all the possible states, and the CRF trainer takes care of generating the proper F(Y|X) for a particular label sequence Y.

3. Answer to your second question is bit simple. All the features are mapped to a contiguous array and each feature is given a unique id. You can look at iitb.Model.FeatureGenImpl class to see the implementation details.

4. Now about the unseen words seen during testing. WordFeatures is the feature that fires all the word features. There is an integer parameter called RARE_THRESHOLD. Any word that is not seen atleast RARE_THRESHOLD times in the training data is considered as a rare / unknown word and is not fired as a feature. There is another feature called UnknownFeatures which is fired only for such rare words.

So, the UnknownFeatures is fired for any word that is seen only in testing data (because its frequency in the training data would be 0 and thus will be less than RARE_THRESHOLD)

5. The last question is bit confusing. First try to understand the meaning of the feature vector and how this is implemented in CRF package. If you need any specific detail about any class/package, then please send a mail.

6. An excellent documentation about using the CRF package is given in

Hope this helps.


On 1/25/07, crf-users-request@lists.sourceforge.net < crf-users-request@lists.sourceforge.net> wrote:
Send Crf-users mailing list submissions to

To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to

You can reach the person managing the list at

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Crf-users digest..."

Today's Topics:

   1. CRF Implementation (debanjan.ghosh@thomson.com)


Message: 1
Date: Wed, 24 Jan 2007 09:13:50 -0500
From: <debanjan.ghosh@thomson.com>
Subject: [Crf-users] CRF Implementation
To: < crf-users@lists.sourceforge.net>
        <FD393E0888D1E342AC6129FE8AF2AA8906BCD472@tlrusnyrocmbx01.ERF.THOMSON.COM >

Content-Type: text/plain;       charset="us-ascii"

> Dear Dr. Sarawagi,
> I am working in the Information Extraction research group of Thomson
Corp. and recently got a chance to use the CRF package that you have
created. Thanks for the excellent implementation, it is really a very
useful module and personally I feel more comfortable than any other
available implementations.
> However, I have few very basic doubts regarding the code, especially
on the usage and values of the weight vector (lambda) during the
training procedure. I will be grateful if you can clarify them.
> Firstly, I started working with the same data corpus (address
sequence) you have implemented in the sample example. From my
understanding of the original Mccallum paper I thought the lambda vector
(weight vector) will have the same length as number of the feature
functions. I generated 4 state feature functions (depending upon the
address data) and 3 transition (emission) feature functions. So the
weight vector has a length of 7 in my case. Whenever I find any feature
from the training data (that is if the data passes any particular
feature function - which is a boolean function) I update the lambda of
the particular feature function index. Where as, in the java CRF
implementation, the weight vector has a length of the size of total
possible feature vectors (I think it is 220). This is little confusing
for me.
> Secondly, I checked out that the generated features are composed of
some functions (all caps, alpha-numeric property, the "word" etc.).
Based on the "feature index" for the weight vector, you apply the
Viterbi algorithm. My doubt is, while finding out the index from weight
vector (during evaluation) how do you match the index of the trained
weight vector? In the training implementation every word is a feature in
your case, if some unseen word (which is very possible) occurs during
the testing procedure then do you index this word as "unseen feature"?
> Lastly, many words can be represented as various features (such as the
"word", alphanumeric or not, starts with caps etc.). While finding the
index of the weight vector to match during evaluation, how do you select
which feature index to be used (from the weight vector)? Do you give
weight to any particular feature (say, if the word matches with a
training data, it is of higher priority) than the other?
> Thanks in advance,
> Regards,
> Debanjan


Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash


Crf-users mailing list

End of Crf-users Digest, Vol 5, Issue 3