UTF-8 encoding?

Help
sracioppa
2011-12-01
2013-04-08
  • sracioppa

    sracioppa - 2011-12-01

    Hello,
    in my Italian grammar I have to use non-ASCII charachters - like "è". This is  mandatory for the project I use the grammar in.

    As I converted the DotCCG-file to UTF-8, testing with test-ccg is no problem. However, special characters are not displayed correctly in the output, for example:

    ok      -       andrÓ meglio la prossima volta

    Well - it doesn't look fine but I can live with that ;-)
    My real problem is: VisCCG doesn't like UTF-8 files. Above this, I can't use special characters by testing with tccg.

    The command line can display these characters, so it doesn't seem to be an operating system issue. Can I change somewhere the OpenCCG settings to make its tools "support" special characters as well?

    Thanks in advance
    Best, Stefania

     
  • Michael White

    Michael White - 2011-12-01

    Hello Stefania

    Thanks for bringing this issue to our attention.  Can you provide a minimal grammar that illustrates the problem and can be used for testing?  It may be that tccg and ccg-test can be fixed by setting the charset to utf8 (not sure about visccg, I haven't used it myself).  If you have a chance to try editing the Java code and manage to fix the problem, please also let us know.

    Ciao, -Mike

     
  • Michael White

    Michael White - 2011-12-06

    I've consulted with Amy Isard on this, and she says she's been using utf8 with no problems for years with German and Greek.  I just did my own testing and it all works fine on linux (tccg input and output, ccg-test display), but on a mac I'm running into display issues in tccg and crashes with 'Invalid byte 1 of 1-byte UTF-8 sequence' errors using ccg-test or with realization in tccg.  I'm at a loss for why this is happening on macs but not on linux.  So, you might try using linux for the time being, if you've been using a mac.

     
  • Michael White

    Michael White - 2011-12-06

    Aha!  More googling has finally turned up the answer, namely to set the default file encoding on the command line as follows (this is at the end of bin/ccg-env; bin/ccg-env.bat is similar):

    JAVA="$JAVA_HOME/bin/java"
    JAVA_MEM="-Xmx256m"
    #JAVA_MEM="-Xmx2048m"
    JAVA_ARGS="$JAVA_MEM -classpath $CP -Dfile.encoding=UTF8"
    

    I've checked this into the cvs.  It seems to work fine on linux and mac os x, where the issue seemed to be that utf8 is not the default encoding for java.  Haven't tested this on windows yet.

    Mike

     
  • sracioppa

    sracioppa - 2012-03-05

    Hello, sorry for the late answer!
    Thank you for the hint. Setting -Dfile.encoding helped a little bit.

    In the Windows XP command box the special characters are still not displayed correctly and VisCCG still doesn't like the encoding. But at least I can "copy" special chars in the tccg line to test sentences and realize them properly.

    Well, it's not the best solution but I can work with it - for now.

    Maybe the problem is my German keyboard: to create accents, I have to type a special "accent key", followed by the corresponding letter (i.e. for è = ` followed by "e"). The tccg line doesn't seem to manage it.

    Please find below a very small grammar containing the character "è".

    Thank you for your help :-)
    Best, Stefania

    ####################### Features #######################
    feature {
      CASE<2>: nom+acc {nom acc} dat;
      GEN<1,2,12>: fem+masc {fem masc};
      PERS<1,2>: non-3rd {1st 2nd} 3rd indet;
      NUM<1,2,12>: sg+pl {sg pl};
      DEGREE<2>: non-sup {base comp} sup;
      TENSE<1>: pres past impf fut;
      VFORM<1>: infin fin prpt ppt;
      
      ontology: entity {
        physical { 
          animate { person animal }
          thing
          }
        situation {
          change { action }
          state
          }
        };
      }
    ######################## Words #########################
    word Maria: Name(person): 3rd sg fem;
    word Marco: Name(person): 3rd sg masc;
    word essere: PredV(state) {
      sono: fin 1st sg pres;
      sei: fin 2nd sg pres;
      è: fin 3rd sg pres;
      siamo: fin 1st pl pres;
      siete: fin 2nd pl pres;
      sono: fin 3rd pl pres;
    }
    word curioso: Adj {
      *: masc sg base;
      curiosa: fem sg base;
      curiosi: masc pl base;
      curiose: fem pl base;
      curiosissimo: masc sg sup;
      curiosissima: fem sg sup;
      curiosissimi: masc pl sup;
      curiosissime: fem pl sup;
    }
    ######################## Lexicon #########################
    family Name(N) {
      entry: np<2>[X]: X (*);
    }
    family Adj {
      entry: pred<2>[X]: X (*);
    }
    family PredV(V) {
      entry: s<1>[E MOOD] \ np<2>[X nom GEN] / pred<12>[Y GEN]: E:state (* <Actor>X:entity <Pred>Y);
    }
    ####################### Testbed ########################
    testbed {
      "Maria è curiosa": 1;
      "Maria è curioso": 0;
      "Marco è curiosa": 0;
      "Marco è curioso": 1;
    }
    
     
  • Michael White

    Michael White - 2012-03-05

    Hello Stefania

    I spent some time looking into this, and unfortunately from what I can tell, command line input of these accented characters is just hopelessly broken on windows.  (I tried various tips posted on the web as well as 3rd-party command line interfaces but nothing was satisfactory - groan.)  However, file output should work just fine, so maybe it's possible to live with the issue.  Seems to work fine on linux and macs.

    Mike

     
  • sracioppa

    sracioppa - 2012-03-06

    Hello Mike,

    thanks a lot. That's what I also supposed. But as you said - the file output works and I can both parse and realize in TCCG, so I will just live with the strange characters.

    Best, Stefania

     

Log in to post a comment.