Menu

#297 hfst-lexc reads unescaped zeros as literal zeros

future
accepted
None
1
2015-12-11
2015-04-28
sjurum
No

There's a lexc parsing bug related to zeros used to align symbol pairs when the lexc entry contains flag diacritics. Consider the following data:

Eksekutivkomitea+OLang/UND:Eksekutiv#komite ALBMOTMUSEA-org ;

LEXICON ALBMOTMUSEA-org !vow+a
0@U.Cap.Obl@+N+Prop+Sem/Org:%>@U.Cap.Obl@000 ALBMOTMUSEA-OBL ;

Eksekutivkomitea+N+Prop+Sg+Nom  Eksekutivkomite000a 0

(+Sem/Org is made optional and thus not needed for generation).

Expected output: no zeros in the generated word form.

Discussion

  • Erik Axelson

    Erik Axelson - 2015-05-18
    • status: open --> accepted
    • assigned_to: Erik Axelson
     
  • Kimmo Koskenniemi

    See my message on 09.12.2015 14:23 to hfst-bugs@helsinki.fi which is not in conflict with the above request. I would like the hfst-lexc to respect Multichar_Symbols which contain zeroes. Now (hfst 3.8.3) it seems to break multi character tokens into single characters if they contain unprotected 0 characters. E.g.

    Multichar_Symbols
    {a0}
    LEXICON Root
    k{a0}k #;
    k{a0}i #;

    $ hfst-lexc test.lexc | hfst-fst2txt
    hfst-lexc: warning: Defaulting to OpenFst tropical type
    Root...0 1 k k 0.000000
    1 2 { { 0.000000
    2 3 a a 0.000000
    3 4 } } 0.000000
    4 5 k k 0.000000
    4 5 i i 0.000000
    5 0.000000

    I would like {a0} to act as a single symbol {a0}, not as a sequence of three symbols {, a, }.

     
  • Erik Axelson

    Erik Axelson - 2015-12-11

    I tested the above case with xfst and foma. Xfst handles {a0} as a single symbol, but foma tokenizes it into { a }, as does hfst-lexc. If I add a percent sign in front of 0, foma, xfst and hfst-lexc work in the same way, i.e. they all tokenize {a0} as a single symbol.

     
  • Kimmo Koskenniemi

    Agreed. In Foma read lexc command, one has to write {a%0} in the lexical entries in order to represent the symbol {a0}. I prefer the XFST convention where the multicharacter definition has priority over the treatment of 0 as an epsilon. Treating 0 as an epsilon effectively prevents using it as a part of a multicharacter symbol.