hfst-lexc reads unescaped zeros as literal zeros

Status: Beta

Brought to you by: eaxelson, hardwick, klindenforge, koskenni, and 3 others

#297 hfst-lexc reads unescaped zeros as literal zeros

Milestone: future

Status: accepted

Owner: Erik Axelson

Labels: None

Priority: 1

Updated: 2015-12-11

Created: 2015-04-28

Creator: sjurum

Private: No

There's a lexc parsing bug related to zeros used to align symbol pairs when the lexc entry contains flag diacritics. Consider the following data:

Eksekutivkomitea+OLang/UND:Eksekutiv#komite ALBMOTMUSEA-org ;

LEXICON ALBMOTMUSEA-org !vow+a
0@U.Cap.Obl@+N+Prop+Sem/Org:%>@U.Cap.Obl@000 ALBMOTMUSEA-OBL ;

Eksekutivkomitea+N+Prop+Sg+Nom  Eksekutivkomite000a 0

(+Sem/Org is made optional and thus not needed for generation).

Expected output: no zeros in the generated word form.

Discussion

Erik Axelson - 2015-05-18

status: open --> accepted

assigned_to: Erik Axelson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Koskenniemi - 2015-12-11

See my message on 09.12.2015 14:23 to hfst-bugs@helsinki.fi which is not in conflict with the above request. I would like the hfst-lexc to respect Multichar_Symbols which contain zeroes. Now (hfst 3.8.3) it seems to break multi character tokens into single characters if they contain unprotected 0 characters. E.g.

Multichar_Symbols
{a0}
LEXICON Root
k{a0}k #;
k{a0}i #;

$ hfst-lexc test.lexc | hfst-fst2txt
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...0 1 k k 0.000000
1 2 { { 0.000000
2 3 a a 0.000000
3 4 } } 0.000000
4 5 k k 0.000000
4 5 i i 0.000000
5 0.000000

I would like {a0} to act as a single symbol {a0}, not as a sequence of three symbols {, a, }.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Erik Axelson - 2015-12-11

I tested the above case with xfst and foma. Xfst handles {a0} as a single symbol, but foma tokenizes it into { a }, as does hfst-lexc. If I add a percent sign in front of 0, foma, xfst and hfst-lexc work in the same way, i.e. they all tokenize {a0} as a single symbol.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kimmo Koskenniemi - 2015-12-11

Agreed. In Foma read lexc command, one has to write {a%0} in the lexical entries in order to represent the symbol {a0}. I prefer the XFST convention where the multicharacter definition has priority over the treatment of 0 as an epsilon. Treating 0 as an epsilon effectively prevents using it as a part of a multicharacter symbol.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: