Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
I want to use opennlp to parse organisation name, locations and People from company descriptions like this:
Aardvark Clear Mine Ltd
Tel: +44 (0)1464 820122
Fax: +44 (0)1464 820985
Managing Director: David Sadler
I am using the en-ner-organization.bin, en-ner-person.bin and en-ner-location.bin models and seperate instances of the NameFinderME class.
The problem is that the en-ner-organization.bin is not giving very accurate results in this context, for example, given the above text, only Clear Mine is tagged as a company and also Shevock Estate is tagged as an organisation, I am guessing because of the capitalisation of the first letters. I believe that the models where trained from news text which would make this context very different, especially with respect to capitalising words.
Would I get better results if I constructed a corpus of company name and addresses and retrained the model?
Also, should I create my own tokeniser that splits the addresses line by line or is it better to evaluate the whole text?
It will probably depend on the size of the possible training set. Though usually, if you are collecting the information yourself, the data tools will put the information in a specific order. How are you collecting the information, would be a good starting question?
I am screen scraping the details from directory websites like this: http://www.armedforces.co.uk/companies/co/h.
I am stripping out the HTML and I wanted to use NLP to get better results.. The problem is that the Company name can appear in a number of different places in the mark up or you might get an ambiguous sequence of text like below:
191 Hawley street
In the above text, it is hard for the parser to deduce which PRODUCTS/SERVICE, Head Office or the J&E Hawley entries is the company name. Sometimes the markup helps but it is something that cannot be relied upon.
Our organization model trained on old news will not really help you as you already know.
You have to train the name finder yourself on your data to get better results.
With your own training data you can even use the html tags and optimize the
Try to get started with a few hundred scrapped samples and see how well the
name finder performs.
Then we can see what could be done to get better results.
If you have any questions about how to do the training, please just ask
Hope this helps,
Excellent stuff and thank you for your answer.
I have an automated process that spiders around websites like the one I mentioned so I should not have too many problems getting a few hundred samples..
I will see how I get on with the wiki and the tests in the opennlp source as my guide for training.
The training set needs to be tagged.
For the name finder, the simplest is the <START> name goes here <END> … series.
<START> Jane Boule <END> wrote her best works in the evening.
Also, the other thing would be to have the web-pages separated some-how so the name finder could be reset while training. I'm guessing that the best thing to do would be to create an HTML document parser that could parse the HTML pages by the tags that declare the start and end of the page <html> </html> tags.
I like the HtmlDocmentParser idea a lot, I am going to do some experimentation. I'll try and write the parser in JRuby in the first instance as it will allow me to move quicker and I also have access to the libxml2 parser via the nokogiri gem which is very good for parsing HTML. I am not sure if there is anything in java that acts as a wrapper around libxml2 and libxslt2.
You do raise an interesting point when you mention reseting the name finder after each document.
How and where are the <START> and <END> tags defined?
Also in normal document training, how are the documents "marked" as a beginning and end of a document or is the training done on seperate files?
Thanks for the idea!
An empty line in the training data marks the document boundary and is used to reset the adaptive feature generation. Right now only the Previous Map Feature Generator is adaptive and maybe not that useful in your case. It remembers how tokens have been classified before.
Hope that helps,