when using @IDENTITY_SYMBOL@ to form prefix in affix-guessify, it doesn't lookup properly with non-trivial automata such as omorfi. When I replace @IDENTITY_SYMBOL@ with x it works like it should but only with x. See patch.
$ tools/src/hfst-affix-guessify -w 1 ~/Koodit/omorfi/src/temporary.ftb3.hfst -o guess.hfst $ hfst-lookup guess.hfst hfst-lookup: warning: It is not possible to perform fast lookups with OpenFST, std arc, tropical semiring format automata. Using HFST basic transducer format and performing slow lookups > xtalo xtalo x#talo N Nom Sg 1,000000 xtalo xalo N Nom Sg 1,000000 xtalo xlo N Nom Sg 1,000000 xtalo xlo N Nom Sg 1,000000 xtalo xo N Nom Sg 1,000000 xtalo xtalo N Nom Sg 1,000000 xtalo xtalo N Prop Nom Sg 1,000000 xtalo xtalo N Nom Sg 1,000000 xtalo x N Abbr#talo N Nom Sg 2,000977 xtalo x#talo N Nom Sg 2,000977 xtalo x%#talo N Nom Sg 2,000977 xtalo x%%#talo N Nom Sg 2,000977 xtalo x%<Del%>%%#talo N Nom Sg 2,000977 xtalo x%>%%#talo N Nom Sg 2,000977 xtalo x-#talo N Nom Sg 2,000977 xtalo x<Del%>%%#talo N Nom Sg 2,000977 xtalo x>%%#talo N Nom Sg 2,000977 xtalo xDel%>%%#talo N Nom Sg 2,000977 xtalo xel%>%%#talo N Nom Sg 2,000977 xtalo xet#talo N Nom Sg 2,000977 xtalo xl%>%%#talo N Nom Sg 2,000977 xtalo xo#talo N Nom Sg 2,000977 xtalo xt#talo N Nom Sg 2,000977 xtalo x←%<Del%>%%#talo N Nom Sg 2,000977 > ytalo ytalo ytalo+? inf
The symptoms look like there would be no harmonization of the two
automata "." and "omorfi" before concatenating them into a guesser,
i.e. if x does not appear in OMorFi, it is treated as an unknown
character by the final concatenated automaton, but as y likely exists in
OMorFi, then without harmoniztion, it is treated as a known character
that is intentionally left out from the . affix.
--
Krister
On 1.6.2014 22:01, Flammie Pirinen wrote:
Related
Bugs: #246
It seems that the problem is in harmonization of the guesser transducer. Hfst-lookup seems to handle identities just fine:
echo "?" | hfst-regexp2fst > tmp
hfst-lookup tmp
echo "? - a" | hfst-regexp2fst > tmp
hfst-lookup tmp
The guesser is not mere concatenation of
< ? @"lexicon" >
but can lead to any suffix of lexicon too so it is built by hand "unharmonised". Inserting all symbols from lexicon to all the identities "harmonising" seems too heavy. Maybe simple weighted guessers just cannot be done using HFST.