Is the the feature extractor of NameFinder only using the tokens seen on the training set, or is it able to use some structured gazetteer features such as a list of international firstnames, organization abbreviations and so on?
If not do you think it could be a good idea to package such feature extractors as pre-trained bloom filter vectors trained on lists coming from wikipedia or freebase for instance?
AFAIK the lucene, hadoop and cassandra projects already provide optimized implementations of bloom filters under the ASL license.
its possible to modify the built-in feature generation and write a feature generator which exploits an external dictionary resource, like a database of first and last names. We would like to add support for dictionary based feature generator.
The current implementation is not good enough and we have to come up with a new set on features which are generated
based on a dictionary lookup, in my opinion the lookup feature should be combined with token context features.
Even relative huge dictionaries can fit into memory, thats why our focus should be on generating better features first,
before we start to scale the dictionary. But work on a bloom filter implementation is very welcome, we also have
plans to add bloom filter based language model.
If you want to work on this, you need a corpus you can train the name finder on, luckily we now have support for
the Conll03 and Conll02 data, depending which language you prefer. James Kosin is also working on this using
the english reuters data from Conll03. There is a wiki page which describe how to create training data out
of the Conll03 data:
Training data creation is very similar for Conll02, but still undocumented.
Here is paper about a Conll03 NER system where they compare the performance which
they get depending on the feature generation, one feature generation strategy uses a
With the dictionary (see Table 2, the row where they add feature I) they get an improvement
of around one percent for both recall and precision.