Training DATA format for NE Recogniser

  • Anonymous - 2010-06-01

    I am using OpenNLP wrapper for UIMA to recognise Named Entities. The Person Finder works for English names, but miss-out Non-English names. I want to train the model with whatever the names it fails to identify. Can someone tell where to find training data? As I understand, I need the full training data set that the model is already trained with, and add the names that I want to add (in the same format) and then train the model.

    Also wondering if the Named Entity Finder uses some kind of a dictionary and that I can simply add missing names into this dictionary.

    Thanks for any help.

  • Joern Kottmann

    Joern Kottmann - 2010-06-01

    Sorry for the late reply.
    The name finder is trained on the muc6 train data.

    You should retrain the models with your additional data like you described it.
    The models as they are right now cannot use a dictionary, because they where trained
    without one.

    Hope that helps,

  • Anonymous - 2010-06-01

    MUC 6 is not freely available.  Please let me know if there is anywahere I can get full of part of it free. I need  to play with creating models to see if it fits my work (before buying the corpus).

    Is it possible to train the model with "CONLL shared task corpora"?

    Can you please give me an example of how the MUC 6 training data looks like so that I can create a few sample training data docs?




  • Joern Kottmann

    Joern Kottmann - 2010-06-01

    I am not aware of a source where you can get MUC 6 data for free.

    There are basically two ways to train the name finder, either via a
    training file or via its API.

    OpenNLP has a one sentence per line name finder training format.
    Here is a sample of one sentence:
    So I called <START> Julie <END> , a friend who's still in contact with him .

    Between each document you should have an empty line, the empty line
    causes a reset of adaptive document features.

    a larger sample can be found in the cvs at this path:

    To use the API call the train method of the NameFinderME, it expects an
    EventStream. Usually you use the NameFinderEventStream class and implement
    a custom NameSampleStream which is responsible for retrieving your training data
    and producing NameSample objects out of it.
    For a sample have a look at the main method of the NameFinderME class.

    The models on the website are all trained with a cutoff of 5 and 100 iterations.
    Especially when you add new names, you should make sure every name
    occurs at least 5 times or reduce the cutoff.


  • Anonymous - 2010-06-02

    Thanks a lot for the information.

    I am actually using the UIMA Wrapper for OpenNLP. There is a descriptor file for training namefinder. I am not sure if this descriptor file is complete and how to run it. I tried running it in CAS Visual Debugger with no luck.

    I will try following your instructions to use OpenNLP tools directly instead. In the meantime, if you know how to use the UIMA wrapper to train models, please let me know.



  • Joern Kottmann

    Joern Kottmann - 2010-06-03

    Actually I did write most of the UIMA wrapper code.

    Yes its possible to train the name finder with the UIMA wrapper, there is a NameFinderTrainer consumer.
    The descriptor must be set up to use your types, it must know your token, sentence and name annotation names.
    In the descriptors folder is one sample name finder trainer descriptor, that file should work for you after you changed
    the type system and type names to match the new type system.

    There is a runCPE script, I suggest you use that, here you can learn more about it:

    For testing you could use one of the sample annotators and let the name finder learn the output e.g. for the MeetingFinder.

    Looks like you are from a research background, in OpenNLP 1.5 I added evaluation for most of the components, that could
    be interesting for you if you need to measure the performance on some test data or via cross validation.

    Hope that helps,

  • Anonymous - 2010-06-03

    Thanks a lot again. That's great information.

    I was trying to train a model using the sample data file "AnnotatedSentences.txt" which is available in the download site. I did simply pass parameters/arguments  to "NameFinderME.main()". The source code of opennlp-tools-1.4.2 is also added to my Eclipse project. I have set the "cutoff" to ONE as the sample data do not have names repeated more than 5 times. It builds a model but the model doesn't work (in UIMA Document Analyzer). I have zipped up the model file as the UIMA wrapper needs zipped models. The error I get now is:

       Error: attempting to load a S model as a GIS model

    I have no idea of what this error means!!

    I tried the descriptor file in the UIMA wrapper too. I have not changed the type system (the one used by all annotator descriptors that came with the UIMA Wrapper). I need to have a look at it again after seen your reply.

    Do I need to create an aggregate descriptor to make a pipeline of tokenising, sentence chunking etc.(the trainer descriptor is not an aggregate descriptor)?

    Yes I am not a hard-core developer. I have a research background (good guess). I would love to try testing/evaluation etc once I get to build models with my own data(?).

    Thanks again for your willing helps.

    Rohana Rajapakse

  • Anonymous - 2010-06-10

    I have now changed the model writer in NameFinderME.main() to "BinaryGISModelWriter" (the original was "SuffixSensitiveGISModelWriter"). I can now build a model. But my model does not create/find any Person entities.
    The original model does. I did debug the code a bit and found that the bestSequence returned by beam.bestSequence(tokens, additionalContext (in NameFinderME.find(…) method) only has the outcomes "other". When I use the original model, i can see
    "start" and "cont" appears where token list has Names. Any clue of what i am doing wrong.

    i am bit concered now if  there is a problem with t he UIMA version I am using (2.3.0).

    I have also tried the PersonNameFinderTrainer.xml descriptor in runCPE. It throws the following exception.

    Exception in thread "main" org.apache.uima.util.InvalidXMLException: An object of class org.apache.uima.collection.metadata.CpeDescription was requested, but the XML input contained an object of class org.apache.uima.collection.impl.CasConsumerDescription_impl. 
    at org.apache.uima.util.impl.XMLParser_impl.parseCpeDescription(
    at org.apache.uima.examples.cpe.SimpleRunCPE.<init>(
    at org.apache.uima.examples.cpe.SimpleRunCPE.main(

    Any ideas why?

    Thanks for any help.

  • Joern Kottmann

    Joern Kottmann - 2010-06-11

    You need to create a CPE descriptor, it then specifies a collection reader and the PersonNameFinderTrainer.xml as a Cas Consumer. Maybe you can have a quick look over the UIMA documentation, its explained there. Otherwise the UIMA tutorial is also a worth a look. It should work fine with UMA 2.3.0.

    If you are interested we could start writing a small tutorial in the new wiki to explain how to get this working.

    Maybe the cutoff is too high and all your names just get removed from the training data ? Can it detect names in the training data itself, after training ?

    What was your issue with the SuffixSensitiveGISModelWriter ? Maybe there is a bug which should be fixed.

    Hope that helps,

  • Anonymous - 2010-06-14

    It is not clear to me how the cut-off works. First I trained the model with cut-off 1 using the sample training data  file "AnnotatedSentences.txt". It could not dtect any names in the training data file itself. However, when I trained the model again using cut-off 3 with the same training data repeated thrice, the model was able to detect names. Now I have trained the model with a bigger (~1000) small docucs (from Reuters). The model is abale to detect names, but there are lots of misses and lots of misreconnition ("So", "therefore"  etc as Names?).  This is the same regardless of the cut-off I use. Is there any sentence patters that you should avaiod having  in the training data file? How big t he training data used to train the  pre-trained model?  Anything else I should else I should know regarding training data?

    The SuffixSensitiveGISModelWriter does create a model, but when I try to use it throws an exception  (something to do with loading "S Model"…). This could be because the code uses BinaryGISModelReader to read-in the model…..

    I will have a look at UIMA documentation about creating a CPE descriptor.  I thought I could use "PersonNameFinderTrainer.xml" in runCPE.

    ok  to start writing a tutorial. It would be good to put together sample descriptor(s), training data and AE code to get it started for people like me.


  • Joern Kottmann

    Joern Kottmann - 2010-06-16

    During training of the model the name finder generates features, every feature which is seen less than n times is just removed. So if you have a cutoff of 5. A "name token" must appear 5 times to include the features which are generated for it in the model. Where a "name token" is a token which is part of a named entity.

    In my experience the way the name finder performs is traceable in the training data. Just use a text editor and search for the cases which are detected incorrectly. Maybe do a search for "therefore" in your training data and check if its annotated correctly, maybe it does not occur more frequently than your cutoff value, you can also check the context and see how that is annotated in your training data.

    Hope that helps,

  • Anonymous - 2010-06-21

    Thanks for your reply.

    I tried training a model with the sample data file (available to download)  again, and tried unannotated version of it (by removing all start and end tags) to test it. The results are very poor and unpredictable. I wonder if anyone has tried training a model with such a small set of data (sample data file)?  The model trained with cut-off 1 do not find any names while the model trained with cut-off 3 finds a couple of names that occur only once in the test file. The only conclusion I can draw is that, either the training is not working OR the training data file doesn't have enough samples in it…

    By the way, I have almost given up using the UIMA CPE to train models. I just couldn't get it to work. I think we really need a simple tutorial with sample code/data to help people like me  to get this started.

    In the mean time, I have looked at Julie labs JNET. It works fine and results of detecting names is predictable. Again I did not manage to get the UIMA Wrapper to work. Anyway, I prefer OpenNlp for its simple training data file format and possibly wider use of it by other Open source projects…

  • Joern Kottmann

    Joern Kottmann - 2010-06-21

    I tried training a model with the sample data file (available to download)  again, and tried unannotated version of it (by removing all start and end tags) to test it. The results are very poor and unpredictable. I wonder if anyone has tried training a model with such a small set of data (sample data file)?  The model trained with cut-off 1 do not find any names while the model trained with cut-off 3 finds a couple of names that occur only once in the test file. The only conclusion I can draw is that, either the training is not working OR the training data file doesn't have enough samples in it…

    The small file just does not contain enough data to train a name finder model. We use it only for unit testing to make sure the code does not has runtime errors.

    Did you train JNET with the reuters corpus ? Do you have a link where I can obtain it ?

    I will try tonight to make a small UIMA sample for you.


  • Anonymous - 2010-06-22

    Yes, I know that the small training data file does not contain enough data, But I expect the model trained with that data file to be able to detect names in the same data file (with annotations removed). What do you think?

    I have trained JNET with CONLL corpus. CONLL is not really a corpus, but a subset of Reuters collection. Have a look here

    The tricky bit is to get the train/test data sets build. CONLL has Linux script files to build them. The linuz shell scripts make calls to a couple of pearl scripts too. You can build a model with or without using POS tags. CONLL's train/test data files need to be pre-processed to extract only the columns you need. I created two files out of the train data set. They are in the following format:

      Token file                          POS file
    =========                        =======
        My            O                        My          PRP$
        name      O                        name    NN
        is             O                        is           VBZ
        Tom        person              Tom      NNP
        .               O                        .             .

    TEst files are the same but has no entity types provided (e.g. Tom   O)

    When you have these files, you can use JNET to create  PPD file out of them. It is a PPD file that JNET needs for training/testing.

    I may be able to pass you the train/test files if you need them. CONLL is free but Reuters is not. So, CONLL do not provide the necessary Reuters data, but the scripts and some list files to generate train/test data. Anyway, if you have a collection of annotated data, then you can create train files for JNET in the required format easily. What data do you use to train OpenNLP models (MUS?). I don't have MUS with me. Any posibility to get hold of full or portion of training data?


  • Anonymous - 2010-06-23

    Thanks for the link. I will certainly try training OpenNLP and JNET with that data.

    By the way, if you get the UIMA descriptor to train OpenNLP models, please pass me whatever is necessary (if you don't mind) ) so that I can try it.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks