Hi,
With larger lexica (several 100,000 words), there are
efficiency problems. Profile information suggest that
the data structures and algorithms could be improved.
It might be worthwhile to read up on the topic.
(Sorry that this is rather vague: I didn't profile the
code (colleagues did), but work with the programme very
often - and I am a computational linguist, i.e. aware
of some of the issues.)
Logged In: YES
user_id=726595
Hi,
I have measured only enormous time (5 s) for 320 000 words
(Hebrew dictionary), and big suggestion times.
Simple spell() call is reasonably fast.
I can suggest some possible optimizations:
- use affix compression (see src/tools/munch)
- use alias compression (make alias compressed aff and dic
file with src/tools/makealias script)
- set bigger hash size in the first line of the dic file
- don't use suggestions or set off ngram suggestions with
MAXNGRAMSUGS 0
affix file parameter.
- use twofold-suffix compression (unfortunatelly, I haven't
implemented the right tool for it, yet)
- try spell checker of Vim or Aspell. Vim spell and Aspell
use different (perhaps faster) algorithms (but they havn't
supported the twofold-suffix compression, yet).
I also plan an improved version with some optimization.
Was loading time or other run-time performance (suggestion)
the bigger problem for you?
Many thanks for your report.
Best regards,
Laci
Logged In: YES
user_id=726595
> I have measured only enormous time (5 s) for 320 000 words
time = loading time of dictionary. Sorry.
Laci
Aspell current versions do handle twofold suffix compression (2008. Sept)
> Aspell current versions do handle twofold suffix compression
It is a good news. Thanks for it. I believe, it will be a big help for dictionary developers of agglutinative languages.
Does it make sense to keep this report open?