Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Organisation Name Recognition

Help
Anonymous
2010-09-29
2013-04-16

  • Anonymous
    2010-09-29

    Hi,

    I want to use opennlp to parse organisation name, locations and People from company descriptions like this:

    Aardvark Clear Mine Ltd
    Aberdeen
     Shevock Estate
     Insch
     Aberdeenshire
    AB52 6XQ
     United Kingdom
    Tel:  +44 (0)1464 820122
    Fax:  +44 (0)1464 820985
    Website:  
    Managing Director:  David Sadler
    

    I am using the en-ner-organization.bin, en-ner-person.bin and en-ner-location.bin models and seperate instances of the NameFinderME class.

    The problem is that the en-ner-organization.bin is not giving very accurate results in this context, for example, given the above text, only Clear Mine is tagged as a company and also Shevock Estate is tagged as an organisation, I am guessing because of the capitalisation of the first letters.  I believe that the models where trained from news text which would make this context very different, especially with respect to capitalising words.

    Would I get better results if I constructed a corpus of company name and addresses and retrained the model?

    Also, should I create my own tokeniser that splits the addresses line by line or is it better to evaluate the whole text?

    Thank You

    Paul

     
  • James Kosin
    James Kosin
    2010-09-30

    Paul,

    It will probably depend on the size of the possible training set.  Though usually, if you are collecting the information yourself, the data tools will put the information in a specific order.  How are you collecting the information, would be a good starting question?

    James

     

  • Anonymous
    2010-09-30

    Hi James,

    I am screen scraping the details from directory websites like this: http://www.armedforces.co.uk/companies/co/h.

    I am stripping out the HTML and I wanted to use NLP to get better results..  The problem is that the Company name can appear in a number of different places in the mark up or you might get an ambiguous sequence of text like below:

    PRODUCTS/SERVICES
    Head Office:
    J&E LTD
    191 Hawley street
    etc.

    In the above text, it is hard for the parser to deduce which PRODUCTS/SERVICE, Head Office or the J&E Hawley entries is the company name.  Sometimes the markup helps but it is something that cannot be relied upon.

    Cheers

    Paul

     
  • Joern Kottmann
    Joern Kottmann
    2010-09-30

    Our organization model trained on old news will not really help you as you already know.
    You have to train the name finder yourself on your data to get better results.
    With your own training data you can even use the html tags and optimize the
    feature generation.

    Try to get started with a few hundred scrapped samples and see how well the
    name finder performs.

    Then we can see what could be done to get better results.
    If you have any questions about how to do the training, please just ask

    Hope this helps,
    Jörn

     

  • Anonymous
    2010-09-30

    Excellent stuff and thank you for your answer.

    I have an automated process that spiders around websites like the one I mentioned so I should not have too many problems getting a few hundred samples..

    I will see how I get on with the wiki and the tests in the opennlp source as my guide for training.

     
  • James Kosin
    James Kosin
    2010-10-01

    Paul,

    The training set needs to be tagged.
    For the name finder, the simplest is the <START> name goes here <END> … series.

    example:

    <START> Jane Boule <END> wrote her best works in the evening.
    

    Also, the other thing would be to have the web-pages separated some-how so the name finder could be reset while training.  I'm guessing that the best thing to do would be to create an HTML document parser that could parse the HTML pages by the tags that declare the start and end of the page <html> </html> tags.

    James

     

  • Anonymous
    2010-10-02

    Hi James,

    I like the HtmlDocmentParser idea a lot, I am going to do some experimentation.  I'll try and write the parser in JRuby in the first instance as it will allow me to move quicker  and I also have  access to the libxml2 parser via the nokogiri gem which is very good for parsing HTML.  I am not sure if there is anything in java that acts as a wrapper around libxml2 and libxslt2.

    You do raise an interesting point when you mention reseting the name finder after each document. 

    How and where are the <START> and <END> tags defined? 

    Also in normal document training, how are the documents "marked" as a beginning and end of a document or is the training done on seperate files?

    Thanks for the idea!

    Paul

     
  • Joern Kottmann
    Joern Kottmann
    2010-10-02

    An empty line in the training data marks the document boundary and is used to reset the adaptive feature generation. Right now only the Previous Map Feature Generator is adaptive and maybe not that useful in your case. It remembers how tokens have been classified before.

    Hope that helps,
    Jörn