Hello,
I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code.
I have been told to research the possibilities of processing natural text to extract some structured data from it.
For example I want to extract(annotate) height and weight from following statements.
"He is 6 feet tall and weighs 200 pounds" or
"His height is 6 feet and weight is 200" etc.
I have looked into UIMA but it seems like a self created REGEX dictionary with no training capabilities.
So in a nutshell, what Java framework can I use to create an annotation engine that can be trained as well!
Any help(pointers) on this will heavily appreciated.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
UIMA is just a framework to build such an application, like your are investigating. It does not
offer you any analysis capabilities, but instead helps you to put together an analysis application
made out of pre-built nlp engines.
OpenNLP can help you to perform tokenization and sentence detection. In order
to extract your structured data, you need to find the "pieces" you need to extract,
there the name finder could help you. It can detect the height and weight.
Afterwards you would need an additional steps to extract the detected numbers
and convert them to a machine understandable format, e.g. a double for the height
in cm.
OpenNLP also offers an UIMA integration, which can be used to solve the
tokenization, sentence detection and Named Entity Extraction tasks you might need
to solve.
Hope that helps,
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code.
I have been told to research the possibilities of processing natural text to extract some structured data from it.
For example I want to extract(annotate) height and weight from following statements.
"He is 6 feet tall and weighs 200 pounds" or
"His height is 6 feet and weight is 200" etc.
I have looked into UIMA but it seems like a self created REGEX dictionary with no training capabilities.
So in a nutshell, what Java framework can I use to create an annotation engine that can be trained as well!
Any help(pointers) on this will heavily appreciated.
Thanks
Hi,
UIMA is just a framework to build such an application, like your are investigating. It does not
offer you any analysis capabilities, but instead helps you to put together an analysis application
made out of pre-built nlp engines.
OpenNLP can help you to perform tokenization and sentence detection. In order
to extract your structured data, you need to find the "pieces" you need to extract,
there the name finder could help you. It can detect the height and weight.
Afterwards you would need an additional steps to extract the detected numbers
and convert them to a machine understandable format, e.g. a double for the height
in cm.
OpenNLP also offers an UIMA integration, which can be used to solve the
tokenization, sentence detection and Named Entity Extraction tasks you might need
to solve.
Hope that helps,
Jörn