OpenNLP Tools 1.5.0 released

Developers
2010-09-23
2013-08-22
  • Joern Kottmann
    Joern Kottmann
    2010-09-23

    We are proud to announce the release of the OpenNLP Tools 1.5.0.
    It has been a long time since our last release, and we got a lot of
    new features which makes using OpenNLP easier.

    Here are the highlights:
    Model packages now group all resources needed by a component in a zip package
    together with meta data. The components can be instantiate from this single zip package resource
    instead of loading multiple resources depending on the training setup.
    Command line interface has been rewritten and extended.
    Built in evaluation for most components.
    New training API for most components.
    Training support for conll02 and conll06 data.
    POS Tagger can now use a perceptron (sequence) model.
    License was changed from LGPL to ASL.
    Ant build system was replaced by maven.

    Jörn

     
    • Weihong Zhang
      Weihong Zhang
      2013-08-22

      I am using a few R packages (openNLP 0.2-1, webmining, sentiment) to extract some sentences about a stock JPM, but running in the following error:

      Error in eval(expr, envir, enclos) : could not find function "sentDetect"

      Here is the codes I used and I made sure that all packages are installed. I checked the "corpus" variable and it is "a corpus with 20 text documents". I also used "library(help=openNLP)" to list all the functions in the package "openNLP" but did not find "sentDetect" in the list.

      library(XML)
      library(tm)
      library(tm.plugin.webmining)
      library(tm.plugin.sentiment)
      library(NLP)
      library(openNLP)
      stock <-"JPM"
      corpus <- WebCorpus(GoogleFinanceSource(stock))
      sentences <- sentDetect(corpus)

      Here is the running environment. Is it possibly related to the R 3.0.1 version (too new for openNLP) or 64-bit Windows system?

      R version 3.0.1 (2013-05-16) -- "Good Sport" Copyright (C) 2013
      The R Foundation for Statistical Computing
      Platform: x86_64-w64-mingw32/x64 (64-bit)

      Thank you very much.

      Weihong

       
  • Martin
    Martin
    2010-11-18

    Hi, I have two questions for help with:
    1) Can I download the 1.5 version with the source code? What I downloaded from this link only contains a .jar file.
        http://opennlp.sourceforge.net/

    2) In order to run POS tagger, do I have to run SentenceDetector & Tokenizer first? The OpenNLP instruction gives the following example:
    "
    POS Tagging:

    bin/opennlp SentenceDetector models/en-sent.bin < text |
    bin/opennlp TokenizerME models/en-token.bin |
    bin/opennlp POSTagger models/en-pos-maxent.bin

    "

    Can I directly run bin/opennlp POSTagger models/en-pos-maxent.bin without running the first two?

    Thanks.

     
  • Joern Kottmann
    Joern Kottmann
    2010-11-18

    The sourceforge download page also offers a source release, please click on "View all files", next to the download
    button.

    The POS Tagger analysis a tokenized sentence at a time. Using the sentence detector and tokenizer like in the example above is one way to produce such input and is intended as a demonstration only.

    Depending on your use case you might already have text which is segmented into sentences and tokens. The API of the POS Tagger could be used to directly pass these sentences to the POS Tagger without using a command line or file system based interface.

    Hope that helps,
    Jörn

     
  • Martin
    Martin
    2010-11-19

    Thanks joernkottmann.

    Another question, the 'find' method in class ' Span find(String tokens) ' takes an array of String as arguments, why doesn't it directly take a String which is typically a sentence as Argument; Otherwise, a sentence string has to be first converted an String array first, which seems very inconvenient?
     

     
  • Joern Kottmann
    Joern Kottmann
    2010-11-19

    Hi,

    the name finders find method expects a string array. This string array "models" a tokenized sentence, each string is one token and all tokens in the array form a sentence.
    To just feed it with a string the token spans must still be passed. Such a method could be added but is not their
    right now.

    In the end, to make everything more efficient we should probably think about moving to CharSequence and
    away from String.

    Hope that helps,
    Jörn

     
  • Martin
    Martin
    2010-11-23

    Joernkottmann:

    What's the problem of this code:
                               NameFinderME nameModel = new NameFinderME(new TokenNameFinderModel(modelIn));
      String sentence = new String("Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate .");
      String tokens = sentence.split(" ");
      Span neTokens = nameModel.find(tokens);
     
      String myTokens = new String();
      String tons = Span.spansToStrings(neTokens, myTokens); /* ERROR occurs here ! */

    Exception in thread "main" java.lang.IllegalArgumentException: The span 0..2 is outside the given text!
    at opennlp.tools.util.Span.getCoveredText(Span.java:178)
    at opennlp.tools.util.Span.spansToStrings(Span.java:262)
    at Test.main(Test.java:46)

    1) What's the way to print out the identified Named Entities?

     
  • James Kosin
    James Kosin
    2010-11-23

    marlomin,

    Just remove the 'String myTokens = new String();' and change the line following to 'String tons = Span.spansToStrings(neTokens, tokens);'  … should fix the problem.
    You are passing an empty string for spansToStrings() to parse out of.

    James

     
  • Martin
    Martin
    2010-11-23

    This seems confusing?
    Span neTokens = nameModel.find(tokens);

    The passed param "tokens" contains the sentence, and the neTokens returned by "find" contains "token spans for any identified names."

    Here,
    String tons = Span.spansToStrings(neTokens, tokens);

    1) Why do we need to pass the sentence arrary "tokens" again to get the NE token?
        Why can't we directly get the NE tokens from the NE token span array "NETokens"?

    2) Another question with to do with the ChunkerME class.
    The method "chunk(String toks, String tags) " only returns "chunk tags for the given sequence returning the result in an array.". It only returns a tag array. How to get the chunked text array?

    Thanks.

     
  • James Kosin
    James Kosin
    2010-11-24

    Here,
    String tons = Span.spansToStrings(neTokens, tokens);

    1) Why do we need to pass the sentence arrary "tokens" again to get the NE token?
        Why can't we directly get the NE tokens from the NE token span array "NETokens"?

    Spans is a class that contains the start and end indexes into the array you originally passed in so it is light-weight.
    It doesn't keep a copy of the original arrays for that purpose.  You can get the strings like that or you can actually read the spans and get the indexes into the original tokens to get the resulting strings.  It is just a matter of preference.

    2) Another question with to do with the ChunkerME class.
    The method "chunk(String toks, String tags) " only returns "chunk tags for the given sequence returning the result in an array.". It only returns a tag array. How to get the chunked text array?

    Start looking at the cmdline interface classes they will show you more.  Many of the main() functions in the classes are not well kept anymore and are being deprecated.
    Basically, it would be the same operation.  The tags you have should match up in number with the tokens you passed in and you would be able to print the values for the token and tag iterating through the list.

    James

     
  • Joern Kottmann
    Joern Kottmann
    2010-11-24

    Why can't we directly get the NE tokens from the NE token span array "NETokens"?

    There are people who need the spans, so just returning the name tokens wouldn't work for them,
    but returning the spans works actually for both cases.
    In the older versions the name finder just returned a string array containing tags which indicate the spans,
    now its really easy to retrieve the tokens of the names.
    The chunker is still doing it, and will be updated to the new style in on of the next versions.

    Jörn

     
  • Purvi Desai
    Purvi Desai
    2012-05-05

    I am getting the java.lang.IllegalArgumentException: The span [4538..4545) is outside the given text! when I use the tokenize(text) method. Any idea why that might be happening? I am passing it a String called text and I get this error only in some cases. Any help would be appreciated!