I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code.
I have been told to research the possibilities of processing natural text to extract some structured data from it.
For example I want to extract(annotate) height and weight from following statements.
"He is 6 feet tall and weighs 200 pounds" or
"His height is 6 feet and weight is 200" etc.
I have looked into UIMA but it seems like a self created REGEX dictionary with no training capabilities.
So in a nutshell, what Java framework can I use to create an annotation engine that can be trained as well!
Any help(pointers) on this will heavily appreciated.
UIMA is just a framework to build such an application, like your are investigating. It does not
offer you any analysis capabilities, but instead helps you to put together an analysis application
made out of pre-built nlp engines.
OpenNLP can help you to perform tokenization and sentence detection. In order
to extract your structured data, you need to find the "pieces" you need to extract,
there the name finder could help you. It can detect the height and weight.
Afterwards you would need an additional steps to extract the detected numbers
and convert them to a machine understandable format, e.g. a double for the height
OpenNLP also offers an UIMA integration, which can be used to solve the
tokenization, sentence detection and Named Entity Extraction tasks you might need
Hope that helps,