#12 Spellchecker efficiency

v1.0 (example)
closed-out-of-date
nobody
None
5
2014-10-17
2006-03-13
Anonymous
No

Hi,

With larger lexica (several 100,000 words), there are
efficiency problems. Profile information suggest that
the data structures and algorithms could be improved.
It might be worthwhile to read up on the topic.
(Sorry that this is rather vague: I didn't profile the
code (colleagues did), but work with the programme very
often - and I am a computational linguist, i.e. aware
of some of the issues.)

Discussion

  • Logged In: YES
    user_id=726595

    Hi,

    I have measured only enormous time (5 s) for 320 000 words
    (Hebrew dictionary), and big suggestion times.
    Simple spell() call is reasonably fast.

    I can suggest some possible optimizations:

    - use affix compression (see src/tools/munch)
    - use alias compression (make alias compressed aff and dic
    file with src/tools/makealias script)
    - set bigger hash size in the first line of the dic file
    - don't use suggestions or set off ngram suggestions with
    MAXNGRAMSUGS 0
    affix file parameter.
    - use twofold-suffix compression (unfortunatelly, I haven't
    implemented the right tool for it, yet)
    - try spell checker of Vim or Aspell. Vim spell and Aspell
    use different (perhaps faster) algorithms (but they havn't
    supported the twofold-suffix compression, yet).

    I also plan an improved version with some optimization.
    Was loading time or other run-time performance (suggestion)
    the bigger problem for you?

    Many thanks for your report.

    Best regards,

    Laci

     
  • Logged In: YES
    user_id=726595

    > I have measured only enormous time (5 s) for 320 000 words

    time = loading time of dictionary. Sorry.

    Laci

     
  • Eleonora
    Eleonora
    2008-09-23

    Aspell current versions do handle twofold suffix compression (2008. Sept)

     
  • > Aspell current versions do handle twofold suffix compression

    It is a good news. Thanks for it. I believe, it will be a big help for dictionary developers of agglutinative languages.

     
  • Does it make sense to keep this report open?

     
    • status: open --> closed-out-of-date
    • Group: --> v1.0 (example)