#31 unexpected behavior for parseParse(String)

open
Thomas Morton
None
5
2009-09-17
2009-09-13
No

Example:
public static void main(String[] args) {
String parseString = "(INC (NP (NNP A.P.E.X)(NNP .)))";
System.out.println(parseString);

Parse differentParse = Parse.parseParse(parseString);
StringBuffer b = new StringBuffer();
differentParse.show(b);
System.out.println(b.toString());
}

Output:
(INC (NP (NNP A.P.E.X)(NNP .)))
(TOP (INC (NP (NNP A.P.E.X) (NNP .))) )

1.) Converting the parse-string back into a Parse object adds another TOP node.
2.) Furthermore, a space between "A.P.E.X" and the last "." is inserted.
Especially the second point is very troublesome for me and took hours to debug :-(
It would be very kind of you to add information about this behaviour to the API documentation.
Thank you very much,
Johannes Neubarth

Discussion

  • Thomas Morton
    Thomas Morton
    2009-09-17

    • assigned_to: nobody --> tsmorton
     
  • Thomas Morton
    Thomas Morton
    2009-09-17

    Hi,
    For the first point I could check for the INC tag and not add the TOP tag in that case. It needs to be added in general as parses used for training need a TOP node.

    As far as the formatting of the output goes, I don't see this as a bug. All parses output have a space between tokens. I'm not sure where the expectation that the input format is maintained comes from. If one puts XML in a dom and output it, it is not formatted exactly the same way as whitespace information is not maintained when parsing.

    Hope this helps...Tom

     
  • First of all, thanks for your response :-)
    I attached a file Test.java that explains how it can happen that parses output do not have a space between tokens (my example really came out of the Tokenizer like this).

    In the meantime, I solved the problem by copying the file Parse.java and modifying the parseParse() function, so this problem is not urgent anymore.
    Background: I use the FrameNet corpus for Semantic Role Labeling, so I need to parse all its sentences. Since parsing takes a lot of time, I wanted to preparse the corpus and store the resulting parses (as some sort of serialization). The Parse.show() function seemed a good way to create these preparses. Unfortunately, as I explained above, reading them back in with parseParse() always adds spaces between tokens.

    Anyway, I just wanted to put this information here, in case anyone ever encounters the same problem.

    Thanks for maintaining OpenNLP, Tom, it's a great piece of work :)
    Hannes

     
  • Hm, I cannot attach files... here's the code:

    package com.vionto.rnd.linguistic.srl.main;
    import java.io.IOException;
    import opennlp.tools.lang.english.TreebankParser;
    import opennlp.tools.parser.AbstractBottomUpParser;
    import opennlp.tools.parser.Parse;
    import opennlp.tools.parser.Parser;
    import opennlp.tools.util.Span;

    public class Test {

    public static final String PARSER_MODEL_DIR_NAME = "/path/to/models";

    public static void main(String[] args) throws IOException {
    Parser parser = TreebankParser.getParser(
    PARSER_MODEL_DIR_NAME,
    false,
    false,
    30,
    AbstractBottomUpParser.defaultAdvancePercentage);

    String sentence = "A.P.E.X. slept .";
    Span[] tokens = new Span[]{new Span(0, 7), new Span(7, 8), new Span(9, 14), new Span(15, 16)};

    Parse parse = new Parse(sentence, new Span(0, sentence.length()), "INC", 1, null);
    for (int i = 0; i < tokens.length; i++) {
    parse.insert(new Parse(
    sentence,
    tokens[i],
    AbstractBottomUpParser.TOK_NODE,
    0,
    i));
    }
    Parse[] parses = parser.parse(parse, 1);
    StringBuffer b = new StringBuffer();
    parses[0].show(b);
    System.out.println(b.toString());

    Parse differentParse = Parse.parseParse(b.toString());
    b = new StringBuffer();
    differentParse.show(b);
    System.out.println(b.toString());

    }
    }

    gives the following output:
    (TOP (S (NP (NNP A.P.E.X)(NNP .)) (VP (VBD slept)) (. .)))
    (TOP (S (NP (NNP A.P.E.X) (NNP .)) (VP (VBD slept)) (. .)) )