Helsinki Finite-State Technology / Bugs / #228 hfst-ospell runs forever on hyper-minimised fst's

Flammie Pirinen - 2014-03-12

It just crossed my mind that we also made changes to optionalise minimisation in spelling part because it wouldn't finish for some langs, maybe this is another thing to look at.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Other issues with hyperminimisation:

in the GTDivvun infra, langs/sma/, the following build will segfault:

cd GTLANGS/sma
./autogen.sh
./configure --with-hfst --without-xfst --enable-spellers
make HFST_LEXC_FLAGS=-F

It segfaults when building the analyser-gt-desc.hfst:

Making all in .
  RGX2FST  analyser-gt-desc.tmp.hfst
/bin/sh: line 1: 67290 Done
/usr/bin/printf "\
    @\"filters/remove-derivation-position-tags.hfst\" \
.o. @\"filters/remove-dialect-tags.hfst\" \
.o. @\"filters/remove-homonymy-tags.hfst\" \
.o. @\"filters/remove-variant-tags.hfst\" \
.o. @\"filters/remove-norm-comp-tags.hfst\" \
.o. @\"filters/remove-number-string-tags.hfst\" \
.o. @\"filters/remove-usage-tags.hfst\" \
.o. @\"filters/remove-semantic-tags.hfst\" \
.o. @\"filters/remove-orig_lang-tags.hfst\" \
.o. @\"filters/remove-orthography-tags.hfst\" \
.o. @\"filters/remove-Orth_IPA-strings.hfst\" \
.o. @\"generator-raw-gt-desc.hfst\" \
.o. @\"filters/remove-hyphenation-marks.hfst\" \
.o. @\"filters/remove-infl_deriv-borders.hfst\" \
.o. @\"filters/remove-word-boundary.hfst\" \
.o. @\"orthography/inituppercase.hfst\" \
.o. @\"orthography/spellrelax.hfst\" ;"
     67291 Segmentation fault: 11  | /usr/local/bin/hfst-regexp2fst -S --harmonize-flags > analyser-gt-desc.tmp.hfst
make[2]: *** [analyser-gt-desc.tmp.hfst] Error 139

When building yrk for spellers, the build seems to go fine, but when testing the speller zhfst file, hfst-ospell segfaults. To reproduce:

cd GTLANGS/yrk
./autogen.sh
./configure --with-hfst --without-xfst --enable-spellers
make HFST_LEXC_FLAGS=-F
make check

The speller test ends as follows:

Making check in spellcheckers
/Applications/Xcode.app/Contents/Developer/usr/bin/make  check-TESTS
Following metadata was read from ZHFST archive:
locale: yrk
version: GT_VERSION [vcsrev: GT_REVISION]
date: DATE
producer: Giellatekno/Divvun/UiT contributors[email: <feedback@divvun.no>, website: <http://divvun.no>]
title [yrk]: Giellatekno/Divvun/UiT fst-based speller for Nenets
description [yrk]: This is an fst-based speller for Nenets. It is based
    on the normative subset of the morphological analyser for Nenets.
    The source code can be found at:
    https://victorio.uit.no/langtech/trunk/langs/yrk/
    License: GPL3+.
acceptor[default.] [id: acceptor.default.hfst, type: generaltrtype: ]
title [yrk]: Giellatekno/Divvun/UiT dictionary Nenets
description[yrk]: Giellatekno/Divvun/UiT dictionary for
    Nenets compiled for HFST.
errmodel[default.] [id: errmodel.default.hfst]
title [yrk]: Levenshtein edit distance transducer
description[yrk]: Correction model for keyboard misstrokes, at most 2 per
    word.
type: default
model: errormodel.default.hfst

./test-zhfst-file.sh: line 19: 67137 Done                    echo nuvviDspeller
     67138 Segmentation fault: 11  | ${HFST_OSPELL} -v ${ZHFSTFILE}
FAIL: test-zhfst-file.sh

Followup comments from the creator of the bug (I had forgotten to log in when I created it):

Using the latest hfst code (revision 3974), these problems seem to be mostly solved when the lexicon is compiled both with the -F and the -M flags of hfst-lexc (ie hyperminimisation and flag minimisation). The KAL test above now runs as follows:

$ time echo illuklu | hfst-ospell tools/spellcheckers/fstbased/hfst/kl.zhfst 
"illuklu" is NOT in the lexicon:
Corrections for "illuklu":
illualu    1.000000
illukulu    1.000000
illullu    1.000000
illulu    1.000000
illuilu    1.000000
illuku    1.000000
illukilu    1.000000
qillukilu    2.000000
qilluilu    2.000000
...

real    0m11.666s
user    0m11.582s
sys 0m0.032s

KAL (and SMA, YRK below) was built using:

$ make HFST_LEXC_FLAGS="-F -M"

The configuration was as reported in the original bug report.

The problem is solved to the extent that the speller now works and returns the expected output. It is also solved in the sense that the fst file is of a managable size: the acceptor is 14 Mb, the error model is now 8.3 Mb, and the zhfst file is just 4.1 Mb.

It is NOT solved, though, in the sense that the speller is still not usable: waiting more than 11 seconds to get suggestions is way too long. So further work needs to be done to speed up the speller.

It is interesting that speed is not an issue for the case reported in the second comment: SMA.

SMA now compiles without issues, and running the speller is definitely much faster than for KAL:

$ time echo gielemes | hfst-ospell tools/spellcheckers/fstbased/hfst/sma.zhfst 
"gielemes" is NOT in the lexicon:
Corrections for "gielemes":
dielemes    1.000000
giefemes    1.000000
jielemes    1.000000
gielemse    1.000000
gïeleles    1.500000
gïelemen    1.500000
gïesemes    1.500000
gïllemes    1.500000
gïelemse    1.500000
gïeleme    1.500000
gïelemem    1.500000
gielteme    2.000000
...

real    0m1.391s
user    0m1.356s
sys 0m0.018s

For shorter input strings the response time is less than a second, which is quite ok for most users.

For YRK, the bug is also fixed, no segmentation fault when running the speller:

Making check in spellcheckers
/Applications/Xcode.app/Contents/Developer/usr/bin/make  check-TESTS
Following metadata was read from ZHFST archive:
locale: yrk
version: GT_VERSION [vcsrev: GT_REVISION]
date: DATE
producer: Giellatekno/Divvun/UiT contributors[email: <feedback@divvun.no>, website: <http://divvun.no>]
title [yrk]: Giellatekno/Divvun/UiT fst-based speller for Nenets
description [yrk]: This is an fst-based speller for Nenets. It is based
    on the normative subset of the morphological analyser for Nenets.
    The source code can be found at:
    https://victorio.uit.no/langtech/trunk/langs/yrk/
    License: GPL3+.
acceptor[default.] [id: acceptor.default.hfst, type: generaltrtype: ]
title [yrk]: Giellatekno/Divvun/UiT dictionary Nenets
description[yrk]: Giellatekno/Divvun/UiT dictionary for
    Nenets compiled for HFST.
errmodel[default.] [id: errmodel.default.hfst]
title [yrk]: Levenshtein edit distance transducer
description[yrk]: Correction model for keyboard misstrokes, at most 2 per
    word.
type: default
model: errormodel.default.hfst

Printing only 0 top suggestions per line
"nuvviDspeller" is NOT in the lexicon:
Corrections for "nuvviDspeller":
Divvun speller for Nenets    0.010000
yrk version 0.1, 2014.08.21, rev98559    0.020000
Built using HFST 3.7.1, rev3968    0.030000

PASS: test-zhfst-file.sh

Also the speed for the YRK speller seems to be just fine, following SMA rather than KAL.

Conclusion: hyperminimisation together with the recently introduced flag minimisation seems to be stable now, producing working analysers and spellers. There is still a speed issue, but only with KAL.

hfst-ospell runs forever on hyper-minimised fst's

Group

Searches

Help

#228 hfst-ospell runs forever on hyper-minimised fst's

Discussion