Menu

Can the OpenNLP parse Chinese sentences

ldd
2008-02-29
2013-04-16
  • ldd

    ldd - 2008-02-29

    I wanna know could the opennlp parser parse chinese sentences. And how to create the model files for chinese parsing. By the way, I already have Chinese TreeBank. Any suggestion would be appreciated. Thanks.

     
    • Thomas Morton

      Thomas Morton - 2008-02-29

      Hi,
         It doesn't do this out of the box, but you could make it support Chinese parsing without very much work.  I've helped a group do this a couple years back with the code-base that eventually became the current parser, so I think I remember all the steps.  Here's what you need to do:

      Get the pos tagger working:
      The pos tagger is integrated into the parser so you need it to work.
      It already has support for specifying the encoding of your text so this should be pretty straight-forward.
      You might also consider changing some of the features used to improve performance.
      Specifically, the prefix and suffix features don't make as much sense in Chinese.
      You might what to add some features which split the character into its radical and the other piece (I forget the name of the other piece).
      Make a tag dictionary from your training corpus to improve speed. (See POSDictionaryWriter.main())

      Get the parser working:
      Add encoding support to the training and testing routines for the parser (see pos tagger for how to do this; its easy)
      Prep your data. One parse per line.  I also remove unary productions of the form X->X.  This is mostly things of the form, NP->NP.  Without this the parser can get stuck building these useless productions.  There may even be code to prevent these from being generated at all.

      You'll also need a different set of head rules.  You can probably get these from some other researcher and they might even be in the same format as the English (non-NP) head rules use or if you know a little about Chinese grammar, you can construct your own.

      I think that is it. 

      I'd love you to donate this back as a patch.  The main reason I haven't built this myself is that I think to make this useful in general opennlp would also need to provide a sentence detector and segmentor/tokenizer.  These are also not difficult to build in the framework, but all together it represents a good chunk of work and time, and I don't currently need to do any Chinese text processing.

      Hope this helps...Tom

       
    • ldd

      ldd - 2008-03-03

      Thanks for your suggestion. I will try in the coming days.

       

Log in to post a comment.