I'm sending in for the training data for the name-fider tomorrow.
Some questions as I scope out the changes:
The feature generators will need to be modified, I didn't see a way to hook into the name-finder at the end and the other models were not catching the name as an organization or other type…
createFeatureGenerator()
seems to be the place to start here. Right?
Unless I add the days and months of the year to the Dictionary, it wont be useful and may get in the way with training other models. I'm guessing I need to key on the
type
parameter to the
train()
function to determine this?
Currently, I'm building 3 dictionaries from the data. Do you think it would be better to keep a single dictionary and have another token field used for the possible types (S-surname, F-female first, M-male first)? Maybe even with probablilies for each?
Thanks,
James K.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
maybe lets try to get started with the name finder documentation.
The intended way to customize the feature generation is to pass a feature generator
to the train method, after training the same feature generator must be passed
to the NameFinderME constructor together with the model the train method
returned.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, thanks for putting the documentation on wiki. It looks fine for the user documentation. I'll go through the javadoc documentation to see what it says on how to prepare a new feature generator.
Next, I sent the signed document to the rueters request email address and hope to get a response in about a week or two maybe sooner. Anyway, it looks like I really can't test anything before I get the training data in for the model then …
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just got confirmation they will be sending the corpus Thursday. I should be able to do some retraining to see if maybe the default of 100 used in the training of the current models may have been too small, as a first attempt.
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Got the code into CVS for the CoNLL03 series, currently only the English Reuters Corpus. I do have both Volume I and Volume II. Volume II has news in other languages other than English; however, it doesn't parallel the news in the English Corpus.
I'll look into the POS parser and what I'd need to do to be able to use the data to train the POS tagger as well with the Reuters Corpus.
Thanks for helping with that… and the new interface with the opennlp.tools.formats helps a great deal.
Now I see what you were talking about.
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you please prefix your next commit message with the issue id. So in your case it should have been.
" Added the CoNLL 03 converter for the english data set for the Reuters data."
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In the factory method, is there any reason why you limit the -types parameter to only select one of the types?
Other than the obvious longer training model when you have more than 2 or 3 outcomes to be trained.
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This model actually caught Otis in the sample document. Hmmm, pointing to maybe a context situation that caused the model to see in the document that wasn't there in the singlely trained model. (hmmmmmmm…)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now you could start to work on a new dictionary feature. I strongly recommend to use more than a few sentences to evaluate the new features. One option you have is to use cross validation, but that only makes sense if the data set contains enough names which are only mentioned in a very few places. If you use a dictionary its important not to optimize it to your data set.
An other option you have is to play with the cutoff parameter, you could set it to 0 and train with gaussian smoothing instead.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Jorn,
I'm sending in for the training data for the name-fider tomorrow.
Some questions as I scope out the changes:
The feature generators will need to be modified, I didn't see a way to hook into the name-finder at the end and the other models were not catching the name as an organization or other type…
seems to be the place to start here. Right?
Unless I add the days and months of the year to the Dictionary, it wont be useful and may get in the way with training other models. I'm guessing I need to key on the
parameter to the
function to determine this?
Currently, I'm building 3 dictionaries from the data. Do you think it would be better to keep a single dictionary and have another token field used for the possible types (S-surname, F-female first, M-male first)? Maybe even with probablilies for each?
Thanks,
James K.
Hi James,
maybe lets try to get started with the name finder documentation.
The intended way to customize the feature generation is to pass a feature generator
to the train method, after training the same feature generator must be passed
to the NameFinderME constructor together with the model the train method
returned.
Jörn
Jorn,
First, thanks for putting the documentation on wiki. It looks fine for the user documentation. I'll go through the javadoc documentation to see what it says on how to prepare a new feature generator.
Next, I sent the signed document to the rueters request email address and hope to get a response in about a week or two maybe sooner. Anyway, it looks like I really can't test anything before I get the training data in for the model then …
James
James,
added a section for you which explains how to do custom feature generation.
Please have a look there and of course feel free to extend it.
Jörn
Jorn,
Just got confirmation they will be sending the corpus Thursday. I should be able to do some retraining to see if maybe the default of 100 used in the training of the current models may have been too small, as a first attempt.
James
Very nice, I will then also try to get a copy. I hope we can write CONLL03 support on the feature list for 1.5.1 :)
Jörn
Jorn,
Got the code into CVS for the CoNLL03 series, currently only the English Reuters Corpus. I do have both Volume I and Volume II. Volume II has news in other languages other than English; however, it doesn't parallel the news in the English Corpus.
I'll look into the POS parser and what I'd need to do to be able to use the data to train the POS tagger as well with the Reuters Corpus.
Thanks for helping with that… and the new interface with the opennlp.tools.formats helps a great deal.
Now I see what you were talking about.
James
Checked out the code looks good :) Don't they just have machine created
non corrected pos tags in this corpus ? But I might be mistaken.
Jörn
Can you please prefix your next commit message with the issue id. So in your case it should have been.
" Added the CoNLL 03 converter for the english data set for the Reuters data."
Jörn
Jorn,
In the factory method, is there any reason why you limit the -types parameter to only select one of the types?
Other than the obvious longer training model when you have more than 2 or 3 outcomes to be trained.
James
James, can you point me to a code line ?
Could be a bug …
Jörn
Jorn,
Starting at line 73 in Conll02NameSampleStreamFactory.java.
They are all tested as if () then else if () else if () else if () blocks. Making them exclusive in nature.
James
Jorn,
I'm also finding some interesting data on the CoNLL.
The baselines for the data are:
I trained the models up to 3000 iterations, with just per types:
This model was not able to register any names from my sample sent earlier with Blanche and Otis as the names used.
Then trained the models using all 4 groupings… org, per, loc, misc.
This model actually caught Otis in the sample document. Hmmm, pointing to maybe a context situation that caused the model to see in the document that wasn't there in the singlely trained model. (hmmmmmmm…)
Now you could start to work on a new dictionary feature. I strongly recommend to use more than a few sentences to evaluate the new features. One option you have is to use cross validation, but that only makes sense if the data set contains enough names which are only mentioned in a very few places. If you use a dictionary its important not to optimize it to your data set.
An other option you have is to play with the cutoff parameter, you could set it to 0 and train with gaussian smoothing instead.
Jörn