Digits in CCG

Help
Robert
2007-02-02
2013-04-08
  • Robert

    Robert - 2007-02-02

    Im sorry if this is a very newbie thing to ask, but how would one go about creating grammar file sentries so that numbers, both integer and floating point could be recognized with openCCG?

    I can understand that you could have lexicon entries for all the digits and a decimal place and then have grammar rules to allow for any combination of numbers separated by a single decimal point, which the lexer could evaluate to a number.  However, with openCCG it seems as if it is impossible to do this because it will only recognize 1523 as a single "word" made up of digits 1523.  I can have lexicon entries as I stated above, however, I would have to have an entry for each and every number possible, which would mean an infinite number of lexicon entries. :)

    So, is it possible to represent numbers in openCCG for recognition?  If so, could you give me a point in the right direction for how it would be done?

    Thank you so much!

     
    • Michael White

      Michael White - 2007-02-03

      Hi

      A good question, as this is not documented.  There's a customizable interface for tokenizers that can handle this.  You can have a look at Tokenizer.java and DefaultTokenizer.java to see how it works, if you're curious.

      Since the default tokenizer should work for you, what you should look at is the routes sample grammar.  In this grammar, numbers are handled by this entry in morph:

        <entry word="[*NUM*]" class="num" pos="Num" macros="@pl"/>

      The link with the tokenizer is made via the [*NUM*] keyword.  The categories for numbers are the ones in lexicon.mxl with pos="Num".

      For example, "2.3 miles" can be parsed as follows:

      tccg> drive west for 2.3 miles .
      1 parse found.

      Parse: sent{index=E_4:action} :
        @d1:action(drive ^
                   <mood>imp ^
                   <Actor>(p1:animate-being ^ pro2) ^
                   <Direction>(w1:direction ^ west) ^
                   <Dist>(m1:dist ^ mile ^
                          <det>nil ^
                          <num>pl ^
                          <Card>(n1:num ^ 2.3)))
      tccg>

      Hope this helps

      -Mike

       
  • Anonymous - 2011-02-08

    Sorry for digging out this old thread, but I have a similar problem.

    Though in my case for a German grammar, the core problem should be similar: We are trying to parse date formats, containing numbers. For example: "The share value dropped between 15th and 19th October."
    While we did look up the  and  expressions in the xml flights grammar, we couldn't figure out how to adopt this to the dot-syntax, which we are working with. Are there similar expressions or is it actually necessary to modify the ccg2xml compiler?

    Best regards,
    Eric

     
  • Michael White

    Michael White - 2011-02-08

    Not sure about handling special tokens like numbers and dates in the dot ccg syntax.  Does anyone?  Does ccg2xml need to know that  is a special word?  Feel free to post a solution if anyone figures this out.  Meanwhile, there's always the workaround of writing a little compile script that calls ccg2xml and then inserts a few lines into the xml files as needed.

     
  • Jason Baldridge

    Jason Baldridge - 2011-02-08

    I think it is quite likely that the dotccg format parser would get confused by  - it is probably trying to interpret the brackets as features. (Or simply gets totally confused.)

     
  • Jean

    Jean - 2011-03-07

    Hi,

    I'm working with Eric on the German grammar (he asked about  and  in DotCCG a couple of posts above this one).  Since DotCCG can't handle these, I decided to extend the compiler to do so. Now it's able handle AMT, DATE, DUR, NUM and TIME (btw are there more of these 'magic' tokens?).  E.g.:

    family Num(closed=false) {
        entry: num;
    }
    word [*NUM*]:Num;
    

    allows to parse numbers:

    tccg> 100
    1 parse found.
    Parse: num
    ----------------------------------
    (lex) 100 :- num
    

    Note that the family needs to be explicitly declared open, otherwise no magic will happen. This is because the DotCCG compiler declares families closed by default (and claims that this is due to a bizarreness in OpenCCG - maybe someone knows what this is about?) and because I didn't want to mess up to much of the code since I'm neither too familiar with OpenCCG's internals nor with all of the DotCCG compiler's details.

    Please let me know whether someone is interested in the modification, I would of course contribute the code.

    Greets,

    Jean

     
  • Michael White

    Michael White - 2011-03-08

    Cool.  That does cover all the magic tokens (see Tokenizer.java), except one for "other" named entities (e.g. people, places, etc.), for which a custom tokenizer must be written if it's to be used.  (So covering these five makes perfect sense.)

    Are the compiler changes all in just one or two files?  It would be great to get them, maybe just by email to me and then I can check 'em over and check 'em in.

    -Mike

     
  • Jean

    Jean - 2011-03-08

    The changes are minor (and commented), only ccg2xml.py needs to be modified. I'll send you a PM.

    Btw tokens of the form  are no longer restricted to the special ones aforementioned. I removed the restriction after recognizing that, for instance,  is a perfectly legal surface form. One further benefit is that the compiler doesn't need to be modified in case new magic IDs are introduced. I've commented everything in case someone wants the restriction back.

    Jean

     
  • Giuliano Lancioni

    Dear Jean,

    could you post the modified ccg2xml.py to the list (if it is possible)? I think your "totally unrestricted" version could be very useful to many openccg users (doubtless to myself!).

    Thank you,

    Giuliano

     
  • Jean

    Jean - 2011-03-08

    I think Mike's going to do that, right? (I don't think I can, on the other hand I'm not used to SourceForge so let me know if I'm wrong ;))

    Btw I'm not sure if I get "totally unrestricted" right. There new token type does have to have the form , where x is an arbitrary string not containing a star. Thus,  as well as  will result in a syntax error. Let me know whether there are cases where this is still too restrictive.

    I hope I could help,

    Jean

     
  • Michael White

    Michael White - 2011-03-09

    Hello Jean

    I've had a quick look at your file, and it seems you've edited ccg2xml.py, which is actually a generated file (see note at the top of the file).  The changes should instead be made to openccg/src/ccg2xml/ccg.ply, which is the real source file (openccg/bin/ccg2xml.py is generated as part of the build process).

    Do you think you could make the changes in the source file, openccg/src/ccg2xml/ccg.ply?  It would be helpful if you could make the changes there and try them out, to make sure they still work as intended.  That way I should be able to quickly ensure it works and check it in to the cvs.

    Thanks,

    -Mike

     
  • Jean

    Jean - 2011-03-09

    Hi Mike,

    done. I checked it and everything seems to work fine. I'll send you a PM with the new cgg.ply.

    Best regards,

    Jean

     
  • Michael White

    Michael White - 2011-03-10

    Thanks, I've checked in the update to ccg.ply to the cvs, so it should be accessible there for anyone who wants these changes.  I tried out the numbers and they seemed to work; didn't try any of the other 'magic' tokens, but it seems they should work as well.

    -Mike

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks