NameFinder featureGenerator

Developers
2010-09-22
2013-04-16
  • James Kosin

    James Kosin - 2010-09-22

    Jorn,

    I'm sending in for the training data for the name-fider tomorrow.

    Some questions as I scope out the changes:
    The feature generators will need to be modified, I didn't see a way to hook into the name-finder at the end and the other models were not catching the name as an organization or other type…

    createFeatureGenerator()
    

    seems to be the place to start here.  Right?

    Unless I add the days and months of the year to the Dictionary, it wont be useful and may get in the way with training other models.  I'm guessing I need to key on the

    type
    

    parameter to the

    train()
    

    function to determine this?

    Currently, I'm building 3 dictionaries from the data.  Do you think it would be better to keep a single dictionary and have another token field used for the possible types (S-surname, F-female first, M-male first)?  Maybe even with probablilies for each?

    Thanks,
    James K.

     
  • Joern Kottmann

    Joern Kottmann - 2010-09-22

    Hi James,

    maybe lets try to get started with the name finder documentation.

    The intended way to customize the feature generation is to pass a feature generator
    to the train method, after training the same feature generator must be passed
    to the NameFinderME constructor together with the model the train method
    returned.

    Jörn

     
  • James Kosin

    James Kosin - 2010-09-23

    Jorn,

    First, thanks for putting the documentation on wiki.  It looks fine for the user documentation.  I'll go through the javadoc documentation to see what it says on how to prepare a new feature generator.

    Next, I sent the signed document to the rueters request email address and hope to get a response in about a week or two maybe sooner.  Anyway, it looks like I really can't test anything before I get the training data in for the model then …

    James

     
  • Joern Kottmann

    Joern Kottmann - 2010-09-23

    James,

    added a section for you which explains how to do custom feature generation.
    Please have a look there and of course feel free to extend it.

    Jörn

     
  • James Kosin

    James Kosin - 2010-09-30

    Jorn,

    Just got confirmation they will be sending the corpus Thursday.  I should be able to do some retraining to see if maybe the default of 100 used in the training of the current models may have been too small, as a first attempt.

    James

     
  • Joern Kottmann

    Joern Kottmann - 2010-09-30

    Very nice,  I will then also try to get a copy. I hope we can write CONLL03 support on the feature list for 1.5.1 :)

    Jörn

     
  • James Kosin

    James Kosin - 2010-10-08

    Jorn,

    Got the code into CVS for the CoNLL03 series, currently only the English Reuters Corpus.  I do have both Volume I and Volume II.  Volume II has news in other languages other than English; however, it doesn't parallel the news in the English Corpus.
    I'll look into the POS parser and what I'd need to do to be able to use the data to train the POS tagger as well with the Reuters Corpus.

    Thanks for helping with that… and the new interface with the opennlp.tools.formats helps a great deal.

    Now I see what you were talking about.

    James

     
  • Joern Kottmann

    Joern Kottmann - 2010-10-08

    Checked out the code looks good :) Don't they just have machine created
    non corrected pos tags in this corpus ? But I might be mistaken.

    Jörn

     
  • Joern Kottmann

    Joern Kottmann - 2010-10-08

    Can you please prefix your next commit message with the issue id. So in your case it should have been.
    "  Added the CoNLL 03 converter for the english data set for the Reuters data."

    Jörn

     
  • James Kosin

    James Kosin - 2010-10-19

    Jorn,

    In the factory method, is there any reason why you limit the -types parameter to only select one of the types?
    Other than the obvious longer training model when you have more than 2 or 3 outcomes to be trained.

    James

     
  • Joern Kottmann

    Joern Kottmann - 2010-10-19

    James, can you point me to a code line ?
    Could be a bug …

    Jörn

     
  • James Kosin

    James Kosin - 2010-10-19

    Jorn,

    Starting at line 73 in Conll02NameSampleStreamFactory.java.

        if (params.getTypes().contains("per")) {
        }
        else if (params.getTypes().contains("org")) {
        }
    

    They are all tested as if () then else if () else if () else if () blocks.  Making them exclusive in nature.

    James

     
  • James Kosin

    James Kosin - 2010-10-21

    Jorn,

    I'm also finding some interesting data on the CoNLL.
    The baselines for the data are:

       bin/baseline eng.train eng.testa | bin/conlleval
    and the results are:
       eng.testa: precision:  78.33%; recall:  65.23%; FB1:  71.18
       eng.testb: precision:  71.91%; recall:  50.90%; FB1:  59.61
       deu.testa: precision:  37.19%; recall:  26.07%; FB1:  30.65
       deu.testb: precision:  31.86%; recall:  28.89%; FB1:  30.30
    

    I trained the models up to 3000 iterations, with just per types:

    2997:  .. loglikelihood=-488.09912654040284     0.9994057587380476
    2998:  .. loglikelihood=-488.0450621507984      0.9994057587380476
    2999:  .. loglikelihood=-487.9910243102479      0.9994057587380476
    3000:  .. loglikelihood=-487.9370129970654      0.9994057587380476
    Writing name finder model ... done (1.716s)
    Wrote name finder model to
    path: C:\Users\James Kosin\Documents\NetBeansProjects\thesis\DocCompare\enNameFi
    nder.model
    Loading Token Name Finder model ... done (0.328s)
    current: 176.5 sent/s avg: 176.5 sent/s total: 185 sent
    current: 590.7 sent/s avg: 377.6 sent/s total: 771 sent
    current: 450.5 sent/s avg: 401.5 sent/s total: 1221 sent
    current: 647.3 sent/s avg: 462.5 sent/s total: 1868 sent
    current: 832.8 sent/s avg: 535.9 sent/s total: 2700 sent
    current: 507.6 sent/s avg: 531.3 sent/s total: 3203 sent
    Average: 531.1 sent/s
    Total: 3251 sent
    Runtime: 6.121s
    Precision: 0.9094630554148683
    Recall: 0.7176981541802389
    F-Measure: 0.8022806250756065
    

    This model was not able to register any names from my sample sent earlier with Blanche and Otis as the names used.

    Then trained the models using all 4 groupings… org, per, loc, misc.

    2997:  .. loglikelihood=-1164.2517100206799     0.9987329401191429
    2998:  .. loglikelihood=-1164.1095495124086     0.9987329401191429
    2999:  .. loglikelihood=-1163.9674624347153     0.9987329401191429
    3000:  .. loglikelihood=-1163.8254487250354     0.9987378512039524
    Writing name finder model ... done (1.996s)
    Wrote name finder model to
    path: C:\Users\James Kosin\Documents\NetBeansProjects\thesis\DocCompare\enNameFi
    nder.model
    Loading Token Name Finder model ... done (0.359s)
    current: 192.3 sent/s avg: 192.3 sent/s total: 195 sent
    current: 689.4 sent/s avg: 438.9 sent/s total: 883 sent
    current: 581.9 sent/s avg: 486.8 sent/s total: 1473 sent
    current: 870.9 sent/s avg: 582.1 sent/s total: 2343 sent
    current: 685.4 sent/s avg: 602.6 sent/s total: 3027 sent
    Average: 602.3 sent/s
    Total: 3251 sent
    Runtime: 5.398s
    Precision: 0.8194893838087334
    Recall: 0.7834062605183439
    F-Measure: 0.8010416847487303
    

    This model actually caught Otis in the sample document.  Hmmm, pointing to maybe a context situation that caused the model to see in the document that wasn't there in the singlely trained model.  (hmmmmmmm…)

     
  • Joern Kottmann

    Joern Kottmann - 2010-10-21

    Now you could start to work on a new dictionary feature. I strongly recommend to use more than a few sentences to evaluate the new features. One option you have is to use cross validation, but that only makes sense if the data set contains enough names which are only mentioned in a very few places. If you use a dictionary its important not to optimize it to your data set.

    An other option you have is to play with the cutoff parameter, you could set it to 0 and train with gaussian smoothing instead.

    Jörn

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks