Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Efficient gazeteer features for NameFinderME

2010-11-02
2013-04-16
  • Olivier Grisel
    Olivier Grisel
    2010-11-02

    Hi all,

    Is the the feature extractor of NameFinder only using the tokens seen on the training set, or is it able to use some structured gazetteer features such as a list of international firstnames, organization abbreviations and so on?

    If not do you think it could be a good idea to package such feature extractors as pre-trained bloom filter vectors trained on lists coming from wikipedia or freebase for instance?

    AFAIK the lucene, hadoop and cassandra projects already provide optimized implementations of bloom filters under the ASL license.

     
  • Joern Kottmann
    Joern Kottmann
    2010-11-09

    Hi,

    its possible to modify the built-in feature generation and write a feature generator which exploits an external dictionary resource, like a database of first and last names.  We would like to add support for dictionary based feature generator.

    The current implementation is not good enough and we have to come up with a new set on features which are generated
    based on a dictionary lookup, in my opinion the lookup feature should be combined with token context features.

    Even relative huge dictionaries can fit into memory, thats why our focus should be on generating better features first,
    before we start to scale the dictionary. But work on a bloom filter implementation is very welcome, we also have
    plans to add bloom filter based language model.

    If you want to work on this, you need a corpus you can train the name finder on, luckily we now have support for
    the Conll03 and Conll02 data, depending which language you prefer. James Kosin is also working on this using
    the english reuters data from Conll03. There is a wiki page which describe how to create training data out
    of the Conll03 data:
    https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Conll03

    Training data creation is very similar for Conll02, but still undocumented.

    Jörn

     
  • Joern Kottmann
    Joern Kottmann
    2010-11-09

    Here is paper about a Conll03 NER system where they compare the performance which
    they get depending on the feature generation, one feature generation strategy uses a
    dictionary:
    http://www.stat.rutgers.edu/home/tzhang/papers/conll03-rrm.pdf

    With the dictionary (see Table 2, the row where they add feature I) they get an improvement
    of around one percent for both recall and precision.

    Jörn