Menu

Extracting info from ill-formed data.

Help
Anonymous
2002-04-22
2002-04-24
  • Anonymous

    Anonymous - 2002-04-22

    Can Grok be useful as a tool to read through technical records describing malfunctions and extract structured information in form “Fault symptoms - Fix” into XML or database tables?
    The records are very short, ill-formed, and full of abbreviations, specific terms and codes pertaining to the subject domain.
    My guess - I need to start with a partial, rule-based parser and a subject domain lexicon.
    Good advice would be very much appreciated.

    Thanks
    Alex.

     
    • Jason Baldridge

      Jason Baldridge - 2002-04-23

      I think what you need here is probably not a full-scale parser, but instead lots of shallow linguistic processors, such as those found in the opennlp.grok.preprocess package.  You could use the framework that we have implemented, adding in your own implementations of the opennlp.common.preprocess.Pipelink interface and mixing and matching them with the ones we already have implemented.

      Alternatively, you could check out Gate (http://gate.ac.uk), which might be quite suited for your task as well, and we may be moving our preprocessing architecture to something much more like Gate's in the near future.

      Yet another alternative... if your domain is really *that* specific and the variation might not be that much, why not whip up a Perl script that munges these records into a different format?  Without looking at your actual data I can't say for sure, but it seems that your task might be a likely candidate for the power of Perl.

       
    • Anonymous

      Anonymous - 2002-04-23

      Thanks for a prompt reply.
      I'm going to try your recommendations one after another and start experimenting with Grok and Gate.
      BTW here is an example of the records:
      ". B.R.D. FOUND U/S ON FUNCTIONAL CHECK INTERNAL FAILURE"
      ". SERVICABLE B.R.D. INSTALLED. FUNCTIONAL CHECK CARRIED OUT A.O.K."
      And here is a required output:

      <problem>
        <unit>BRD</unit>
        <procedure>Functional check</procedure>
        <symptoms>
          <s1>Internal failure</s1>
        </symptoms>
      </problem>
      <fix>Serviceable BRD installed</fix>

      Sorry for a possible mess with text formatting.
      Haven't figured out yet how to deal with this.

      Thank you.
      Alex.

       
    • Jason Baldridge

      Jason Baldridge - 2002-04-24

      This really does look like a job for Perl, if you ask me.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.