Example:
public static void main(String[] args) {
String parseString = "(INC (NP (NNP A.P.E.X)(NNP .)))";
System.out.println(parseString);
Parse differentParse = Parse.parseParse(parseString);
StringBuffer b = new StringBuffer();
differentParse.show(b);
System.out.println(b.toString());
}
Output:
(INC (NP (NNP A.P.E.X)(NNP .)))
(TOP (INC (NP (NNP A.P.E.X) (NNP .))) )
1.) Converting the parse-string back into a Parse object adds another TOP node.
2.) Furthermore, a space between "A.P.E.X" and the last "." is inserted.
Especially the second point is very troublesome for me and took hours to debug :-(
It would be very kind of you to add information about this behaviour to the API documentation.
Thank you very much,
Johannes Neubarth
Hi,
For the first point I could check for the INC tag and not add the TOP tag in that case. It needs to be added in general as parses used for training need a TOP node.
As far as the formatting of the output goes, I don't see this as a bug. All parses output have a space between tokens. I'm not sure where the expectation that the input format is maintained comes from. If one puts XML in a dom and output it, it is not formatted exactly the same way as whitespace information is not maintained when parsing.
Hope this helps...Tom
First of all, thanks for your response :-)
I attached a file Test.java that explains how it can happen that parses output do not have a space between tokens (my example really came out of the Tokenizer like this).
In the meantime, I solved the problem by copying the file Parse.java and modifying the parseParse() function, so this problem is not urgent anymore.
Background: I use the FrameNet corpus for Semantic Role Labeling, so I need to parse all its sentences. Since parsing takes a lot of time, I wanted to preparse the corpus and store the resulting parses (as some sort of serialization). The Parse.show() function seemed a good way to create these preparses. Unfortunately, as I explained above, reading them back in with parseParse() always adds spaces between tokens.
Anyway, I just wanted to put this information here, in case anyone ever encounters the same problem.
Thanks for maintaining OpenNLP, Tom, it's a great piece of work :)
Hannes
Hm, I cannot attach files... here's the code:
package com.vionto.rnd.linguistic.srl.main;
import java.io.IOException;
import opennlp.tools.lang.english.TreebankParser;
import opennlp.tools.parser.AbstractBottomUpParser;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.util.Span;
public class Test {
public static final String PARSER_MODEL_DIR_NAME = "/path/to/models";
public static void main(String[] args) throws IOException {
Parser parser = TreebankParser.getParser(
PARSER_MODEL_DIR_NAME,
false,
false,
30,
AbstractBottomUpParser.defaultAdvancePercentage);
String sentence = "A.P.E.X. slept .";
Span[] tokens = new Span[]{new Span(0, 7), new Span(7, 8), new Span(9, 14), new Span(15, 16)};
Parse parse = new Parse(sentence, new Span(0, sentence.length()), "INC", 1, null);
for (int i = 0; i < tokens.length; i++) {
parse.insert(new Parse(
sentence,
tokens[i],
AbstractBottomUpParser.TOK_NODE,
0,
i));
}
Parse[] parses = parser.parse(parse, 1);
StringBuffer b = new StringBuffer();
parses[0].show(b);
System.out.println(b.toString());
Parse differentParse = Parse.parseParse(b.toString());
b = new StringBuffer();
differentParse.show(b);
System.out.println(b.toString());
}
}
gives the following output:
(TOP (S (NP (NNP A.P.E.X)(NNP .)) (VP (VBD slept)) (. .)))
(TOP (S (NP (NNP A.P.E.X) (NNP .)) (VP (VBD slept)) (. .)) )