The OpenNLP Grok Library / Discussion / Help: Extracting info from ill-formed data.

Anonymous - 2002-04-22

Can Grok be useful as a tool to read through technical records describing malfunctions and extract structured information in form “Fault symptoms - Fix” into XML or database tables?
The records are very short, ill-formed, and full of abbreviations, specific terms and codes pertaining to the subject domain.
My guess - I need to start with a partial, rule-based parser and a subject domain lexicon.
Good advice would be very much appreciated.

Thanks
Alex.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason Baldridge - 2002-04-23
  
  I think what you need here is probably not a full-scale parser, but instead lots of shallow linguistic processors, such as those found in the opennlp.grok.preprocess package. You could use the framework that we have implemented, adding in your own implementations of the opennlp.common.preprocess.Pipelink interface and mixing and matching them with the ones we already have implemented.
  
  Alternatively, you could check out Gate (http://gate.ac.uk), which might be quite suited for your task as well, and we may be moving our preprocessing architecture to something much more like Gate's in the near future.
  
  Yet another alternative... if your domain is really *that* specific and the variation might not be that much, why not whip up a Perl script that munges these records into a different format? Without looking at your actual data I can't say for sure, but it seems that your task might be a likely candidate for the power of Perl.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2002-04-23
  
  Thanks for a prompt reply.
  I'm going to try your recommendations one after another and start experimenting with Grok and Gate.
  BTW here is an example of the records:
  ". B.R.D. FOUND U/S ON FUNCTIONAL CHECK INTERNAL FAILURE"
  ". SERVICABLE B.R.D. INSTALLED. FUNCTIONAL CHECK CARRIED OUT A.O.K."
  And here is a required output:
  
  <problem>
    <unit>BRD</unit>
    <procedure>Functional check</procedure>
    <symptoms>
      <s1>Internal failure</s1>
    </symptoms>
  </problem>
  <fix>Serviceable BRD installed</fix>
  
  Sorry for a possible mess with text formatting.
  Haven't figured out yet how to deal with this.
  
  Thank you.
  Alex.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason Baldridge - 2002-04-24
  
  This really does look like a job for Perl, if you ask me.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Extracting info from ill-formed data.

Forums

Help

Extracting info from ill-formed data. document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Extracting info from ill-formed data.