I am looking for a tool that performs sentece boundary detection, part-of-speech tagging and phrase-chunking. My problem is that most of these tools return annotated text. What I need, however, is the absolute positions in text of the sentece boundaries and of the chunks. For example, consider the following sentences:
"This is a sentence. And here is another one."
I would need the information that the 19th and respectivly the 44th character in the text is a sentence boundary. For the chunks, the position and the length of the chunk would be ideal.
I have checked OpenNLP, Gate, LingPipe and MontyLingua but did not find any information about such an output.
Is anyone aware of such a tool?
the sentence detector is able to return positions. Just have a look at the sentPosDetect method. It takes a string and returns an array of Spans (in the 1.5 code, in 1.4 it returns an array with boundary indexes). The spans are start and end offsets of the detected sentences.
The Chunker is right now not capable of outputting Spans, but I am in favor to change that. If you want to accelerate this just send us a patch.
Do you have your own training data ? Or do you want to use our pre-trained models ?
Hope this helps,