I am using openNLP namefinder to extracts the named-entities from the given web-page. The process I am currently following is (Only for person, but can be extended to any other entites as well) ;
a. Get the news feeds and extract the named-entity from the news feed using the existing model. If the model fails to identify the name of the person, it's tagged using the simple regex.
b. Using this data I re-train the model (I have reduced the cut-of to 3).
c. Than I repeat the step a on previous feeds to see if the original missed names get identified and on some test feeds which is used to put the improvement.
But after 2-3 runs the model starts showing degrading quality in other words the original model looks to be more clean than what I am using for training.
Can anyone of you guide me in getting my training data proper.
--Thanks and Regards
Vaijanath
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Are you starting with the model that is distributed with OpenNlp or only with your own model? If you're starting with the opennlp one, then the issue is that you don't have the data that the original model was trained on. When you re-train it you lose the information contained in the original model. The degradation probably starts after the first re-training.
If you are just training on your own data, then this suggests that the new data you are adding might be noise and is degrading your models performance. Also if you are only looking at a couple of cases, the model now misses some cases it didn't before but is preforming ok on the whole. You just need to make sure you have a reasonable sized testing corpus to evaluate performance improvements.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But what is the reasonable(minimum) size for the testing corpus.
Where could I get the data from which the original model was trained.
How could I train the model in the incremental manner. Is there any other approach to do it because we don't have the original data.
Thanks
Ashu
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There isn't currently a good work around for this. I've been looking at setting up a service to let people annotate their own data and train models based on that and other data that I can't distribute, and then let them down load their model but am not there yet.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Morton and group members,
I am using openNLP namefinder to extracts the named-entities from the given web-page. The process I am currently following is (Only for person, but can be extended to any other entites as well) ;
a. Get the news feeds and extract the named-entity from the news feed using the existing model. If the model fails to identify the name of the person, it's tagged using the simple regex.
b. Using this data I re-train the model (I have reduced the cut-of to 3).
c. Than I repeat the step a on previous feeds to see if the original missed names get identified and on some test feeds which is used to put the improvement.
But after 2-3 runs the model starts showing degrading quality in other words the original model looks to be more clean than what I am using for training.
Can anyone of you guide me in getting my training data proper.
--Thanks and Regards
Vaijanath
Hi,
Are you starting with the model that is distributed with OpenNlp or only with your own model? If you're starting with the opennlp one, then the issue is that you don't have the data that the original model was trained on. When you re-train it you lose the information contained in the original model. The degradation probably starts after the first re-training.
If you are just training on your own data, then this suggests that the new data you are adding might be noise and is degrading your models performance. Also if you are only looking at a couple of cases, the model now misses some cases it didn't before but is preforming ok on the whole. You just need to make sure you have a reasonable sized testing corpus to evaluate performance improvements.
Hope this helps...Tom
Hi,
But what is the reasonable(minimum) size for the testing corpus.
Where could I get the data from which the original model was trained.
How could I train the model in the incremental manner. Is there any other approach to do it because we don't have the original data.
Thanks
Ashu
Hi,
I would say a reasonable minimum is about 10k-15k sentences. You can get about 13k of data via:
http://www.cnts.ua.ac.be/conll2003/ner/ but you'll also need to order the text from NIST which isn't too big of a deal.
There isn't currently a good work around for this. I've been looking at setting up a service to let people annotate their own data and train models based on that and other data that I can't distribute, and then let them down load their model but am not there yet.
Hope this helps...Tom
Thanks Tom,
But unable to open the specified link. Could you provide me another link where I can get conll2003 corpus.
Thanks
Ashu
Hi,
This is the only source I know of for this data...Tom