Bug is from here http://wiki.apertium.org/wiki/User:Firespeaker/HFST_bug. You can use attached patch as a test case. Shortly put: assuming dictioary of three strings {word, formation, word form}, the input string {word formation} should be split into 2 tokens, but the proc parser gets lost into {word form} and having input left fails to backtrack earlier valid string over whitespace.
I'm having trouble applying the patch. Which directory should I apply it from?
This works better, now it prints out the space. Has not been tested.
This one works even better, but has a bug with losing newlines and full stops
Input:
Батыс шеті (46 27 ш.б.) Эльтон және Басқұншақ көлдері маңына, ал шығыс нүктесі 87 20 ш.б.
Бұқтырма өзенінің бастауына сай келеді.
Output:
^Батыс/Бат<v><iv><coop><imp><p2><sg>/Батыс<n><attr>/Батыс<n><nom>/Батыс<n><nom>+е<cop><p3><pl>/Батыс<n><nom>+е<cop><p3><sg>$ ^шеті/шет<n><px3sp><nom>/шет<n><px3sp><nom>+е<cop><p3><pl>/шет<n><px3sp><nom>+е<cop><p3><sg>$ ^(/(<lpar>$^46/46<num>/46<num><subst><nom>$ ^27/27<num>/27<num><subst><nom>$ ^ш/ш$^б./б.$ ^Эльтон/Эльтон$ ^және/және<cnjcoo>$ ^Басқұншақ/</cnjcoo>Басқұншақ$ ^көлдері/көл<n><pl><px3sp><nom>/көл<n><pl><px3sp><nom>+е<cop><p3><pl>/көл<n><pl><px3sp><nom>+е<cop><p3><sg>$ ^маңына/маң<n><px3sp><dat>$^,/,<cm>$ ^ал/ал<cnjcoo>/ал<v><tv><imp><p2><sg>/ал<vaux><imp><p2><sg>/алд<n><attr>/алд<n><nom>/алд<n><nom>+е<cop><p3><pl>/алд<n><nom>+е<cop><p3><sg>$ ^шығыс/шығыс<n><attr>/шығыс<n><nom>/шығыс<n><nom>+е<cop><p3><pl>/шығыс<n><nom>+е<cop><p3><sg>/шық<v><iv><coop><imp><p2><sg>$ ^нүктесі/нүкте<n><px3sp><nom>/нүкте<n><px3sp><nom>+е<cop><p3><pl>/нүкте<n><px3sp><nom>+е<cop><p3><sg>$ ^87/87<num>/87<num><subst><nom>$ ^20/20$ ^ш/ш$^б./б.$^Бұқтырма/Бұқтырма$ ^өзенінің/өзен<n><px3sp><gen>$ ^бастауына/баста<v><tv><ger><px3sp><dat>/баста<vaux><ger><px3sp><dat>$ ^сай/сай<adj>/сай<adj><subst><nom>/сай<adj><subst><nom>+е<cop><p3><pl>/сай<adj><subst><nom>+е<cop><p3><sg>$ ^келеді/кел<v><iv><aor><p3><pl>/кел<v><iv><aor><p3><sg>/кел<vaux><aor><p3><pl>/кел<vaux><aor><p3><sg>$^./.<sent>$</sent></sg></p3></aor></vaux></pl></p3></aor></vaux></sg></p3></aor></iv></v></pl></p3></aor></iv></v></sg></p3></cop></nom></subst></adj></pl></p3></cop></nom></subst></adj></nom></subst></adj></adj></dat></px3sp></ger></vaux></dat></px3sp></ger></tv></v></gen></px3sp></n></nom></subst></num></num></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p2></imp></coop></iv></v></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></vaux></sg></p2></imp></tv></v></cnjcoo></cm></dat></px3sp></n></sg></p3></cop></nom></px3sp></pl></n></pl></p3></cop></nom></px3sp></pl></n></nom></px3sp></pl></n></nom></subst></num></num></nom></subst></num></num></lpar></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></coop></iv></v>
(Note that the newline is lost before 'Бұқтырма', the full stop is lost before between ш and б, and the closing parenthesis is lost before 'Эльтон')
I suspect the problem is how spaces and punctuation are dealt with. For example, in lttoolbox we have two different kinds of sections for the dictionary, "standard" and "inconditional" (unconditional strong acceptance state).
See the difference between "strong acceptance states" and "weak acceptance states" in:
http://www.dlsi.ua.es/~mlf/docum/garrido02p.pdf (p.4)
Last edit: Francis Tyers 2013-03-18