Menu

#153 hfst-apertium-proc confuses multiwords with partial matches

future
open
nobody
proc (7)
1
2013-03-18
2013-01-19
No

Bug is from here http://wiki.apertium.org/wiki/User:Firespeaker/HFST_bug. You can use attached patch as a test case. Shortly put: assuming dictioary of three strings {word, formation, word form}, the input string {word formation} should be split into 2 tokens, but the proc parser gets lost into {word form} and having input left fails to backtrack earlier valid string over whitespace.

1 Attachments

Discussion

  • Jonathan

    Jonathan - 2013-02-21

    I'm having trouble applying the patch. Which directory should I apply it from?

     
  • Francis Tyers

    Francis Tyers - 2013-03-17

    This works better, now it prints out the space. Has not been tested.

     
  • Francis Tyers

    Francis Tyers - 2013-03-18

    This one works even better, but has a bug with losing newlines and full stops

    Input:

    Батыс шеті (46 27 ш.б.) Эльтон және Басқұншақ көлдері маңына, ал шығыс нүктесі 87 20 ш.б.
    Бұқтырма өзенінің бастауына сай келеді.

    Output:

    ^Батыс/Бат<v><iv><coop><imp><p2><sg>/Батыс<n><attr>/Батыс<n><nom>/Батыс<n><nom>+е<cop><p3><pl>/Батыс<n><nom>+е<cop><p3><sg>$ ^шеті/шет<n><px3sp><nom>/шет<n><px3sp><nom>+е<cop><p3><pl>/шет<n><px3sp><nom>+е<cop><p3><sg>$ ^(/(<lpar>$^46/46<num>/46<num><subst><nom>$ ^27/27<num>/27<num><subst><nom>$ ^ш/ш$^б./б.$ ^Эльтон/Эльтон$ ^және/және<cnjcoo>$ ^Басқұншақ/</cnjcoo>Басқұншақ$ ^көлдері/көл<n><pl><px3sp><nom>/көл<n><pl><px3sp><nom>+е<cop><p3><pl>/көл<n><pl><px3sp><nom>+е<cop><p3><sg>$ ^маңына/маң<n><px3sp><dat>$^,/,<cm>$ ^ал/ал<cnjcoo>/ал<v><tv><imp><p2><sg>/ал<vaux><imp><p2><sg>/алд<n><attr>/алд<n><nom>/алд<n><nom>+е<cop><p3><pl>/алд<n><nom>+е<cop><p3><sg>$ ^шығыс/шығыс<n><attr>/шығыс<n><nom>/шығыс<n><nom>+е<cop><p3><pl>/шығыс<n><nom>+е<cop><p3><sg>/шық<v><iv><coop><imp><p2><sg>$ ^нүктесі/нүкте<n><px3sp><nom>/нүкте<n><px3sp><nom>+е<cop><p3><pl>/нүкте<n><px3sp><nom>+е<cop><p3><sg>$ ^87/87<num>/87<num><subst><nom>$ ^20/20$ ^ш/ш$^б./б.$^Бұқтырма/Бұқтырма$ ^өзенінің/өзен<n><px3sp><gen>$ ^бастауына/баста<v><tv><ger><px3sp><dat>/баста<vaux><ger><px3sp><dat>$ ^сай/сай<adj>/сай<adj><subst><nom>/сай<adj><subst><nom>+е<cop><p3><pl>/сай<adj><subst><nom>+е<cop><p3><sg>$ ^келеді/кел<v><iv><aor><p3><pl>/кел<v><iv><aor><p3><sg>/кел<vaux><aor><p3><pl>/кел<vaux><aor><p3><sg>$^./.<sent>$</sent></sg></p3></aor></vaux></pl></p3></aor></vaux></sg></p3></aor></iv></v></pl></p3></aor></iv></v></sg></p3></cop></nom></subst></adj></pl></p3></cop></nom></subst></adj></nom></subst></adj></adj></dat></px3sp></ger></vaux></dat></px3sp></ger></tv></v></gen></px3sp></n></nom></subst></num></num></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p2></imp></coop></iv></v></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></vaux></sg></p2></imp></tv></v></cnjcoo></cm></dat></px3sp></n></sg></p3></cop></nom></px3sp></pl></n></pl></p3></cop></nom></px3sp></pl></n></nom></px3sp></pl></n></nom></subst></num></num></nom></subst></num></num></lpar></sg></p3></cop></nom></px3sp></n></pl></p3></cop></nom></px3sp></n></nom></px3sp></n></sg></p3></cop></nom></n></pl></p3></cop></nom></n></nom></n></attr></n></sg></p2></imp></coop></iv></v>

    (Note that the newline is lost before 'Бұқтырма', the full stop is lost before between ш and б, and the closing parenthesis is lost before 'Эльтон')

    I suspect the problem is how spaces and punctuation are dealt with. For example, in lttoolbox we have two different kinds of sections for the dictionary, "standard" and "inconditional" (unconditional strong acceptance state).

    See the difference between "strong acceptance states" and "weak acceptance states" in:

    http://www.dlsi.ua.es/~mlf/docum/garrido02p.pdf (p.4)

     

    Last edit: Francis Tyers 2013-03-18
MongoDB Logo MongoDB