There's a lexc parsing bug related to zeros used to align symbol pairs when the lexc entry contains flag diacritics. Consider the following data:
Eksekutivkomitea+OLang/UND:Eksekutiv#komite ALBMOTMUSEA-org ;
LEXICON ALBMOTMUSEA-org !vow+a
0@U.Cap.Obl@+N+Prop+Sem/Org:%>@U.Cap.Obl@000 ALBMOTMUSEA-OBL ;
Eksekutivkomitea+N+Prop+Sg+Nom Eksekutivkomite000a 0
(+Sem/Org is made optional and thus not needed for generation).
Expected output: no zeros in the generated word form.
See my message on 09.12.2015 14:23 to hfst-bugs@helsinki.fi which is not in conflict with the above request. I would like the hfst-lexc to respect Multichar_Symbols which contain zeroes. Now (hfst 3.8.3) it seems to break multi character tokens into single characters if they contain unprotected 0 characters. E.g.
Multichar_Symbols
{a0}
LEXICON Root
k{a0}k #;
k{a0}i #;
$ hfst-lexc test.lexc | hfst-fst2txt
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...0 1 k k 0.000000
1 2 { { 0.000000
2 3 a a 0.000000
3 4 } } 0.000000
4 5 k k 0.000000
4 5 i i 0.000000
5 0.000000
I would like {a0} to act as a single symbol {a0}, not as a sequence of three symbols {, a, }.
I tested the above case with xfst and foma. Xfst handles {a0} as a single symbol, but foma tokenizes it into { a }, as does hfst-lexc. If I add a percent sign in front of 0, foma, xfst and hfst-lexc work in the same way, i.e. they all tokenize {a0} as a single symbol.
Agreed. In Foma read lexc command, one has to write {a%0} in the lexical entries in order to represent the symbol {a0}. I prefer the XFST convention where the multicharacter definition has priority over the treatment of 0 as an epsilon. Treating 0 as an epsilon effectively prevents using it as a part of a multicharacter symbol.