Helsinki Finite-State Technology / Bugs / #153 hfst-apertium-proc confuses multiwords with partial matches

This one works even better, but has a bug with losing newlines and full stops

Input:

Батыс шеті (46 27 ш.б.) Эльтон және Басқұншақ көлдері маңына, ал шығыс нүктесі 87 20 ш.б.
Бұқтырма өзенінің бастауына сай келеді.

Output:

^Батыс/Бат<v><iv><coop><imp><p2><sg>/Батыс<n><attr>/Батыс<n><nom>/Батыс<n><nom>+е<cop><p3><pl>/Батыс<n><nom>+е<cop><p3><sg>$ ^шеті/шет<n><px3sp><nom>/шет<n><px3sp><nom>+е<cop><p3><pl>/шет<n><px3sp><nom>+е<cop><p3><sg>$ ^(/(<lpar>$^46/46<num>/46<num><subst><nom>$ ^27/27<num>/27<num><subst><nom>$ ^ш/ш$^б./б.$ ^Эльтон/Эльтон$ ^және/және<cnjcoo>$ ^Басқұншақ/</cnjcoo>Басқұншақ$ ^көлдері/көл<n><pl><px3sp><nom>/көл<n><pl><px3sp><nom>+е<cop><p3><pl>/көл<n><pl><px3sp><nom>+е<cop><p3><sg>$ ^маңына/маң<n><px3sp><dat>$^,/,<cm>$ ^ал/ал<cnjcoo>/ал<v><tv><imp><p2><sg>/ал<vaux><imp><p2><sg>/алд<n><attr>/алд<n><nom>/алд<n><nom>+е<cop><p3><pl>/алд<n><nom>+е<cop><p3><sg>$ ^шығыс/шығыс<n><attr>/шығыс<n><nom>/шығыс<n><nom>+е<cop><p3><pl>/шығыс<n><nom>+е<cop><p3><sg>/шық<v><iv><coop><imp><p2><sg>$ ^нүктесі/нүкте<n><px3sp><nom>/нүкте<n><px3sp><nom>+е<cop><p3><pl>/нүкте<n><px3sp><nom>+е<cop><p3><sg>$ ^87/87<num>/87<num><subst><nom>$ ^20/20$ ^ш/ш$^б./б.$^Бұқтырма/Бұқтырма$ ^өзенінің/өзен<n><px3sp><gen>$ ^бастауына/баста<v><tv><ger><px3sp><dat>/баста<vaux><ger><px3sp><dat>$ ^сай/сай<adj>/сай<adj><subst><nom>/сай<adj><subst><nom>+е<cop><p3><pl>/сай<adj><subst><nom>+е<cop><p3><sg>$ ^келеді/кел<v><iv><aor><p3><pl>/кел<v><iv><aor><p3><sg>/кел<vaux><aor><p3><pl>/кел<vaux><aor><p3><sg>$^./.<sent>$</sent></sg></p3></aor></vaux></pl></p3></aor></vaux></sg></p3></aor></iv></v></pl></p3></aor></iv></v></sg></p3></cop></nom></subst></adj></pl></p3></cop></nom></subst></adj></nom></subst></adj></adj></dat></px3sp></ger></vaux></dat></px3sp></ger></tv></v></gen></px3sp></n></nom></subst></num></num></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p2></imp></coop></iv></v></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></vaux></sg></p2></imp></tv></v></cnjcoo></cm></dat></px3sp></n></sg></p3></cop></nom></px3sp></pl></n></pl></p3></cop></nom></px3sp></pl></n></nom></px3sp></pl></n></nom></subst></num></num></nom></subst></num></num></lpar></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></coop></iv></v>

(Note that the newline is lost before 'Бұқтырма', the full stop is lost before between ш and б, and the closing parenthesis is lost before 'Эльтон')

I suspect the problem is how spaces and punctuation are dealt with. For example, in lttoolbox we have two different kinds of sections for the dictionary, "standard" and "inconditional" (unconditional strong acceptance state).

See the difference between "strong acceptance states" and "weak acceptance states" in:

http://www.dlsi.ua.es/~mlf/docum/garrido02p.pdf (p.4)

Last edit: Francis Tyers 2013-03-18

hfst-proc-bug-153-2ndpatch.diff

hfst-apertium-proc confuses multiwords with partial matches

Group

Searches

Help

#153 hfst-apertium-proc confuses multiwords with partial matches

Discussion