NameFinder Dictionary

Developers
2010-07-21
2013-04-16
  • James Kosin
    James Kosin
    2010-07-21

    All,

    I'm thinking there has to be a simpler way.  That is about names that is…

    (1)  People names are usually contain a first name followed by either another name for the middle name, or a surname as the last name.  Examples are:  "John Henry", "Smith, Troy", "John Smith Robinson", etc.

    (2)  Street addresses are usually a number followed by a name (which can also be confused with a person's name), followed by key words like "Street", "Circle", "Lane", "Avenue", … etc.

    (3)  Company names are usually a name (which can also be confused with a person's name), followed by key words like "Incorporated", "Corporation", "Company", "Enterprise", … etc.

    Is there already intelligence like this in the name finder?
    If not, how difficult would it be to add?

    I posted a patch to be able to create some name dictionaries for any who want to try this out.  Or work on some of the variations.
    I'm still plugging away at the code right now; so, this isn't complete yet…

    James

     
  • Joern Kottmann
    Joern Kottmann
    2010-07-21

    Actually the name finder  already generates features for the context of a token. This context includes the tokens before and after the current token. Maybe a combination of the context and a dictionary feature can be quite valuable to boost the detection of unknown names.

    Maybe you can try to write a special person name feature generator which can generate better features for a person name context.

    The feature generation for the name finder can easily changed by passing feature generators during training and the same feature generators for name finding. In the next version after 1.5 we hopefully able to store the feature generator configuration within in the model. Then its easy to customize the feature generation for the type of entity which should be detected.

    Jörn

     
  • James Kosin
    James Kosin
    2010-07-22

    2010-07-21 07:44:21 EDT
    Actually the name finder  already generates features for the context of a token. This context includes the tokens before and after the current token. Maybe a combination of the context and a dictionary feature can be quite valuable to boost the detection of unknown names.

    I think I know what you mean… by ogmenting the feature generator; we should be able to catch more names and not have to rely on text that has every known name listed.  The idea would be to use the dictionary to either validate the name is really a valid name by looking in a dictionary of names; or by allowing the dictionary to possibly help with generalizing the training data… by either replacing or modifying the names with tags as to (male), (female), (surname), etc as properties for the names.

    Then, we could feed generalized rules to the training and be able to generate large sets of training data from a simple set of templates.

    I'll look into this as well, but, I've got to get back to my thesis for now.  I still may need this for my thesis anyway, if I run into problems with names outside of the issues I currently have.  I still need to find out if the namefinder will help sort some of the words into an order of some kind (at least separating names) from the rest of the text.  This way I could keep the names and sex of the name along with the name to allow future comparisons with common pronouns that may be used to replace the name.  ie: he, she, him, her, etc.

    James

     
  • Joern Kottmann
    Joern Kottmann
    2010-07-22

    Yes that sounds nice. But instead of replacing the names just adjust/modify the feature generation.

    To get the same effect as with replacing we should generate a dictionary feature and then combine it with the context words.

    You can see how the default feature generation is configured
    in NameFinderME.createFeatureGenerator().
    I suggest for testing to create a new Feature Generator which uses the InSpanFeatureGenerator
    and combines the In Span Feature with the features of the TokenFeatureGenerator and TokenClassFeatureGenerator.

    Do you have data where you can train the name finder on ?

    Jörn

     
  • James Kosin
    James Kosin
    2010-07-25

    Do you have data where you can train the name finder on ?

    Jorn,

    I've been creating the training data manually.  Mostly small training sets and not a good variety of data or formats.

    From what I've seen, the tags for the namefinder are fairly simple, with only two major formats.  The first, and simplest is the <START><END> sequences around a name.  The next seems to be a variation on this allowing a parameter for a type of name <START:(nametype)><END> sequence.

    It seems to be simple enough to create the training data for this one.  I'll give it a shot before asking for help on this.

    Thanks,
    James