Menu

#228 hfst-ospell runs forever on hyper-minimised fst's

future
open
nobody
None
1
2014-08-21
2014-03-07
Anonymous
No

Steps to reproduce:

  1. build hfst-ospell @HEAD (revision 3791 tested)
  2. build $GTHOME/langs/sme/, with ./configure --without-xfst --with-hfst --enable-spellers
  3. time echo illullu | ./hfst-ospell $GTHOME/langs/sme/tools/spellcheckers/fstbased/hfst/se.zhfst
  4. build $GTHOME/langs/kal/, with ./configure --without-xfst --with-hfst --enable-spellers,
    make as follows to turn hyper-minimisation on: make HFST_LEXC_FLAGS=-F
  5. time echo illuklu | ./hfst-ospell $GTHOME/langs/kal/tools/spellcheckers/fstbased/hfst/kl.zhfst

Note that sme returns in < 0.5 sec, whereas kal never returns (I waited more than 15 minutes).
The only difference is the hyper-minimisation.

OS: MacOSX 10.9.2

Both SME and KAL are svn trunk@HEAD.

Discussion

  • Flammie Pirinen

    Flammie Pirinen - 2014-03-12

    It just crossed my mind that we also made changes to optionalise minimisation in spelling part because it wouldn't finish for some langs, maybe this is another thing to look at.

     
  • sjurum

    sjurum - 2014-04-22

    Other issues with hyperminimisation:

    in the GTDivvun infra, langs/sma/, the following build will segfault:

    cd GTLANGS/sma
    ./autogen.sh
    ./configure --with-hfst --without-xfst --enable-spellers
    make HFST_LEXC_FLAGS=-F
    

    It segfaults when building the analyser-gt-desc.hfst:

    Making all in .
      RGX2FST  analyser-gt-desc.tmp.hfst
    /bin/sh: line 1: 67290 Done
    /usr/bin/printf "\
        @\"filters/remove-derivation-position-tags.hfst\" \
    .o. @\"filters/remove-dialect-tags.hfst\" \
    .o. @\"filters/remove-homonymy-tags.hfst\" \
    .o. @\"filters/remove-variant-tags.hfst\" \
    .o. @\"filters/remove-norm-comp-tags.hfst\" \
    .o. @\"filters/remove-number-string-tags.hfst\" \
    .o. @\"filters/remove-usage-tags.hfst\" \
    .o. @\"filters/remove-semantic-tags.hfst\" \
    .o. @\"filters/remove-orig_lang-tags.hfst\" \
    .o. @\"filters/remove-orthography-tags.hfst\" \
    .o. @\"filters/remove-Orth_IPA-strings.hfst\" \
    .o. @\"generator-raw-gt-desc.hfst\" \
    .o. @\"filters/remove-hyphenation-marks.hfst\" \
    .o. @\"filters/remove-infl_deriv-borders.hfst\" \
    .o. @\"filters/remove-word-boundary.hfst\" \
    .o. @\"orthography/inituppercase.hfst\" \
    .o. @\"orthography/spellrelax.hfst\" ;"
         67291 Segmentation fault: 11  | /usr/local/bin/hfst-regexp2fst -S --harmonize-flags > analyser-gt-desc.tmp.hfst
    make[2]: *** [analyser-gt-desc.tmp.hfst] Error 139
    

    When building yrk for spellers, the build seems to go fine, but when testing the speller zhfst file, hfst-ospell segfaults. To reproduce:

    cd GTLANGS/yrk
    ./autogen.sh
    ./configure --with-hfst --without-xfst --enable-spellers
    make HFST_LEXC_FLAGS=-F
    make check
    

    The speller test ends as follows:

    Making check in spellcheckers
    /Applications/Xcode.app/Contents/Developer/usr/bin/make  check-TESTS
    Following metadata was read from ZHFST archive:
    locale: yrk
    version: GT_VERSION [vcsrev: GT_REVISION]
    date: DATE
    producer: Giellatekno/Divvun/UiT contributors[email: <feedback@divvun.no>, website: <http://divvun.no>]
    title [yrk]: Giellatekno/Divvun/UiT fst-based speller for Nenets
    description [yrk]: This is an fst-based speller for Nenets. It is based
        on the normative subset of the morphological analyser for Nenets.
        The source code can be found at:
        https://victorio.uit.no/langtech/trunk/langs/yrk/
        License: GPL3+.
    acceptor[default.] [id: acceptor.default.hfst, type: generaltrtype: ]
    title [yrk]: Giellatekno/Divvun/UiT dictionary Nenets
    description[yrk]: Giellatekno/Divvun/UiT dictionary for
        Nenets compiled for HFST.
    errmodel[default.] [id: errmodel.default.hfst]
    title [yrk]: Levenshtein edit distance transducer
    description[yrk]: Correction model for keyboard misstrokes, at most 2 per
        word.
    type: default
    model: errormodel.default.hfst
    
    ./test-zhfst-file.sh: line 19: 67137 Done                    echo nuvviDspeller
         67138 Segmentation fault: 11  | ${HFST_OSPELL} -v ${ZHFSTFILE}
    FAIL: test-zhfst-file.sh
    
     
  • sjurum

    sjurum - 2014-08-21

    Followup comments from the creator of the bug (I had forgotten to log in when I created it):

    Using the latest hfst code (revision 3974), these problems seem to be mostly solved when the lexicon is compiled both with the -F and the -M flags of hfst-lexc (ie hyperminimisation and flag minimisation). The KAL test above now runs as follows:

    $ time echo illuklu | hfst-ospell tools/spellcheckers/fstbased/hfst/kl.zhfst 
    "illuklu" is NOT in the lexicon:
    Corrections for "illuklu":
    illualu    1.000000
    illukulu    1.000000
    illullu    1.000000
    illulu    1.000000
    illuilu    1.000000
    illuku    1.000000
    illukilu    1.000000
    qillukilu    2.000000
    qilluilu    2.000000
    ...
    
    real    0m11.666s
    user    0m11.582s
    sys 0m0.032s
    

    KAL (and SMA, YRK below) was built using:

    $ make HFST_LEXC_FLAGS="-F -M"
    

    The configuration was as reported in the original bug report.

    The problem is solved to the extent that the speller now works and returns the expected output. It is also solved in the sense that the fst file is of a managable size: the acceptor is 14 Mb, the error model is now 8.3 Mb, and the zhfst file is just 4.1 Mb.

    It is NOT solved, though, in the sense that the speller is still not usable: waiting more than 11 seconds to get suggestions is way too long. So further work needs to be done to speed up the speller.

    It is interesting that speed is not an issue for the case reported in the second comment: SMA.

    SMA now compiles without issues, and running the speller is definitely much faster than for KAL:

    $ time echo gielemes | hfst-ospell tools/spellcheckers/fstbased/hfst/sma.zhfst 
    "gielemes" is NOT in the lexicon:
    Corrections for "gielemes":
    dielemes    1.000000
    giefemes    1.000000
    jielemes    1.000000
    gielemse    1.000000
    gïeleles    1.500000
    gïelemen    1.500000
    gïesemes    1.500000
    gïllemes    1.500000
    gïelemse    1.500000
    gïeleme    1.500000
    gïelemem    1.500000
    gielteme    2.000000
    ...
    
    real    0m1.391s
    user    0m1.356s
    sys 0m0.018s
    

    For shorter input strings the response time is less than a second, which is quite ok for most users.

    For YRK, the bug is also fixed, no segmentation fault when running the speller:

    Making check in spellcheckers
    /Applications/Xcode.app/Contents/Developer/usr/bin/make  check-TESTS
    Following metadata was read from ZHFST archive:
    locale: yrk
    version: GT_VERSION [vcsrev: GT_REVISION]
    date: DATE
    producer: Giellatekno/Divvun/UiT contributors[email: <feedback@divvun.no>, website: <http://divvun.no>]
    title [yrk]: Giellatekno/Divvun/UiT fst-based speller for Nenets
    description [yrk]: This is an fst-based speller for Nenets. It is based
        on the normative subset of the morphological analyser for Nenets.
        The source code can be found at:
        https://victorio.uit.no/langtech/trunk/langs/yrk/
        License: GPL3+.
    acceptor[default.] [id: acceptor.default.hfst, type: generaltrtype: ]
    title [yrk]: Giellatekno/Divvun/UiT dictionary Nenets
    description[yrk]: Giellatekno/Divvun/UiT dictionary for
        Nenets compiled for HFST.
    errmodel[default.] [id: errmodel.default.hfst]
    title [yrk]: Levenshtein edit distance transducer
    description[yrk]: Correction model for keyboard misstrokes, at most 2 per
        word.
    type: default
    model: errormodel.default.hfst
    
    Printing only 0 top suggestions per line
    "nuvviDspeller" is NOT in the lexicon:
    Corrections for "nuvviDspeller":
    Divvun speller for Nenets    0.010000
    yrk version 0.1, 2014.08.21, rev98559    0.020000
    Built using HFST 3.7.1, rev3968    0.030000
    
    PASS: test-zhfst-file.sh
    

    Also the speed for the YRK speller seems to be just fine, following SMA rather than KAL.

    Conclusion: hyperminimisation together with the recently introduced flag minimisation seems to be stable now, producing working analysers and spellers. There is still a speed issue, but only with KAL.