I have a query similar to one that I saw unanswered in the archives (http://sourceforge.net/mailarchive/message.php?msg_id=18752494 ). I am trying to use your CRF implementation for POS tagging. I am able to get the initial code up & running on the Penn Treebank ATIS corpus. I am aiming at using orthographic features from the data for training as well. For example, apart from supplying the word I would also include features indicating capitalization (caps) as well as common English suffixes (e.g. -ing and -s) as well as features for words that start with a number.
Currently I am able to get the program running on data in the below format, where each line consists of a word token separated from its part of speech by a | delimiter with sentences separated by blank lines.
I would like to modify the above data & use it in the below format where I have 2 features for each word:
<Word1> <feature1> <feature2>|<number_indicating_POS_tag>
It would be really helpful if you could tell me how these features should be passed to the CRF module while training & testing. I have gone through the code & previous posts on the mailing list but such orthographic features which are a part of the training data itself do not seem to have been considered ever.