Obviously at a very basic level, I could compare string sizes or I could use regex's to compare capitalisation or punctuation but I was wondering if there was any way using opennlp to tell the difference? This will help me better understand the framework.
I have just started using opennlp, mainly for named entity recognition.
I am using opennlp primarily to find people and organisation names in the text for an HTML scraper that parses company details from the html.
I wonder if anybody could point me in the right direction for a few questions. I am looking for any sort of guidance at all or any input.
I would like to be able to tell the difference between a normal sentence like:
"J & E Hall provides the complete service to refrigeration and air conditioning users. Over 300 staff are employed, including around 120 service engineers throughout the UK."
Or what would be even better, is if I could tell that a phrase like "Contact this company" was not an address part but "Questor House" was. Again, at a very basic level, capitalisation of the first letter of every word is something that could be used but definitely not guaranteed. Is there any other logic that can be used?
Also, with regards to named entity recoginiton, I have been using the en-ner-organization.bin and it has been recognising text like
"TERMS & CONDITIONS"
As organisations, I can see why as it is capitalised and on a line of its own but is there anything can be done to train the model that this is not an organisation name apart from something basic like a dictionary of excluded terms?
I love the framework so far and I want to know more.
OpenNLP does not has a component which can detect the structure of the text, e.g. headlines, sub-headlines or just distinguish pieces of text from real sentences.
Sadly the name finder training data are news articles which are quite a few years old. Right now we do not have a
system where our users can complement our training data with text pieces which are detected incorrectly.
To get over your issue you would have to get a corpus and then extend it with your text snippets which are
not detected correctly. James Kosin added support to OpenNLP to convert the Conll03 data into the name finder
training format. See the wiki page he created for further information:
The Conll03 data provides annotation for a reuters corpus, you could get both free of charge
after doing a little paper work.
Note: Conll03 support is only in the current 1.5.1-SNAPSHOT version and not in the released 1.5.0 version.
The OpenNLP ner models are trained on MUC data, which is different to the Conl03 data, so your result
will likely be different.
Hope that helps,
Thank you for your detailed answer Jörn.
As far as assembling my own corpus, could I put the tags in the HTML or would I need to extract the data from the HTML like this:
HALL (J&E) LTD
191 Hawley Road
Tel:+44 (0)1322 223456
Fax: +44 (0)1322 291458
If you train it entirely yourself I would recommend to keep the html tags, if you want to complement your training data with an existing corpus you should remove them.
I think the html tags contains important information about the structure which the name finder could use. It will learn quickly that html tags are never a part of a name.
After you tagged your first 100 or 200 snippets you can do semi-automatic tagging, where you let the name finder
first tag the snippet, then you verify or correct it and add the snippet to the training data.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.