OpenNLP / Discussion / Developers: NameFinder featureGenerator

James Kosin - 2010-09-22

Jorn,

I'm sending in for the training data for the name-fider tomorrow.

Some questions as I scope out the changes:
The feature generators will need to be modified, I didn't see a way to hook into the name-finder at the end and the other models were not catching the name as an organization or other type…

createFeatureGenerator()

seems to be the place to start here. Right?

Unless I add the days and months of the year to the Dictionary, it wont be useful and may get in the way with training other models. I'm guessing I need to key on the

type

parameter to the

train()

function to determine this?

Currently, I'm building 3 dictionaries from the data. Do you think it would be better to keep a single dictionary and have another token field used for the possible types (S-surname, F-female first, M-male first)? Maybe even with probablilies for each?

Thanks,
James K.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-09-22

Hi James,

maybe lets try to get started with the name finder documentation.

The intended way to customize the feature generation is to pass a feature generator
to the train method, after training the same feature generator must be passed
to the NameFinderME constructor together with the model the train method
returned.

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Kosin - 2010-09-23

Jorn,

First, thanks for putting the documentation on wiki. It looks fine for the user documentation. I'll go through the javadoc documentation to see what it says on how to prepare a new feature generator.

Next, I sent the signed document to the rueters request email address and hope to get a response in about a week or two maybe sooner. Anyway, it looks like I really can't test anything before I get the training data in for the model then …

James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-09-23

James,

added a section for you which explains how to do custom feature generation.
Please have a look there and of course feel free to extend it.

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Kosin - 2010-09-30

Jorn,

Just got confirmation they will be sending the corpus Thursday. I should be able to do some retraining to see if maybe the default of 100 used in the training of the current models may have been too small, as a first attempt.

James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-09-30

Very nice, I will then also try to get a copy. I hope we can write CONLL03 support on the feature list for 1.5.1 :)

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Kosin - 2010-10-08

Jorn,

Got the code into CVS for the CoNLL03 series, currently only the English Reuters Corpus. I do have both Volume I and Volume II. Volume II has news in other languages other than English; however, it doesn't parallel the news in the English Corpus.
I'll look into the POS parser and what I'd need to do to be able to use the data to train the POS tagger as well with the Reuters Corpus.

Thanks for helping with that… and the new interface with the opennlp.tools.formats helps a great deal.

Now I see what you were talking about.

James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-10-08

Checked out the code looks good :) Don't they just have machine created
non corrected pos tags in this corpus ? But I might be mistaken.

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-10-08

Can you please prefix your next commit message with the issue id. So in your case it should have been.
" Added the CoNLL 03 converter for the english data set for the Reuters data."

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Kosin - 2010-10-19

Jorn,

In the factory method, is there any reason why you limit the -types parameter to only select one of the types?
Other than the obvious longer training model when you have more than 2 or 3 outcomes to be trained.

James

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-10-19

James, can you point me to a code line ?
Could be a bug …

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

James Kosin - 2010-10-19

Jorn,

Starting at line 73 in Conll02NameSampleStreamFactory.java.

if (params.getTypes().contains("per")) { } else if (params.getTypes().contains("org")) { }

They are all tested as if () then else if () else if () else if () blocks. Making them exclusive in nature.

James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jorn,

I'm also finding some interesting data on the CoNLL.
The baselines for the data are:

   bin/baseline eng.train eng.testa | bin/conlleval
and the results are:
   eng.testa: precision:  78.33%; recall:  65.23%; FB1:  71.18
   eng.testb: precision:  71.91%; recall:  50.90%; FB1:  59.61
   deu.testa: precision:  37.19%; recall:  26.07%; FB1:  30.65
   deu.testb: precision:  31.86%; recall:  28.89%; FB1:  30.30

I trained the models up to 3000 iterations, with just per types:

2997:  .. loglikelihood=-488.09912654040284     0.9994057587380476
2998:  .. loglikelihood=-488.0450621507984      0.9994057587380476
2999:  .. loglikelihood=-487.9910243102479      0.9994057587380476
3000:  .. loglikelihood=-487.9370129970654      0.9994057587380476
Writing name finder model ... done (1.716s)
Wrote name finder model to
path: C:\Users\James Kosin\Documents\NetBeansProjects\thesis\DocCompare\enNameFi
nder.model
Loading Token Name Finder model ... done (0.328s)
current: 176.5 sent/s avg: 176.5 sent/s total: 185 sent
current: 590.7 sent/s avg: 377.6 sent/s total: 771 sent
current: 450.5 sent/s avg: 401.5 sent/s total: 1221 sent
current: 647.3 sent/s avg: 462.5 sent/s total: 1868 sent
current: 832.8 sent/s avg: 535.9 sent/s total: 2700 sent
current: 507.6 sent/s avg: 531.3 sent/s total: 3203 sent
Average: 531.1 sent/s
Total: 3251 sent
Runtime: 6.121s
Precision: 0.9094630554148683
Recall: 0.7176981541802389
F-Measure: 0.8022806250756065

This model was not able to register any names from my sample sent earlier with Blanche and Otis as the names used.

Then trained the models using all 4 groupings… org, per, loc, misc.

2997:  .. loglikelihood=-1164.2517100206799     0.9987329401191429
2998:  .. loglikelihood=-1164.1095495124086     0.9987329401191429
2999:  .. loglikelihood=-1163.9674624347153     0.9987329401191429
3000:  .. loglikelihood=-1163.8254487250354     0.9987378512039524
Writing name finder model ... done (1.996s)
Wrote name finder model to
path: C:\Users\James Kosin\Documents\NetBeansProjects\thesis\DocCompare\enNameFi
nder.model
Loading Token Name Finder model ... done (0.359s)
current: 192.3 sent/s avg: 192.3 sent/s total: 195 sent
current: 689.4 sent/s avg: 438.9 sent/s total: 883 sent
current: 581.9 sent/s avg: 486.8 sent/s total: 1473 sent
current: 870.9 sent/s avg: 582.1 sent/s total: 2343 sent
current: 685.4 sent/s avg: 602.6 sent/s total: 3027 sent
Average: 602.3 sent/s
Total: 3251 sent
Runtime: 5.398s
Precision: 0.8194893838087334
Recall: 0.7834062605183439
F-Measure: 0.8010416847487303

This model actually caught Otis in the sample document. Hmmm, pointing to maybe a context situation that caused the model to see in the document that wasn't there in the singlely trained model. (hmmmmmmm…)

Joern Kottmann - 2010-10-21

Now you could start to work on a new dictionary feature. I strongly recommend to use more than a few sentences to evaluate the new features. One option you have is to use cross validation, but that only makes sense if the data set contains enough names which are only mentioned in a very few places. If you use a dictionary its important not to optimize it to your data set.

An other option you have is to play with the cutoff parameter, you could set it to 0 and train with gaussian smoothing instead.

Jörn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

NameFinder featureGenerator

Forums

Help

NameFinder featureGenerator

NameFinder featureGenerator

Forums

Help

NameFinder featureGenerator document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

NameFinder featureGenerator