HunSpell, Perl and generic morphology

  • Amir E. Aharoni

    Amir E. Aharoni - 2009-12-06


    I was trying to find a Free Software program that would be able to identify the part of speech and grammatical status of a word and that would work generically for as many languages as possible. For example, given the word "parla" in a Catalan text, the program would return:

    1. verb, imperative, 2 sg.
    2. verb, present, 3 sg.
    3. noun, sg., fem.

    Since i, as a linguist, work with texts in many languages - Hebrew, Belarusian, Lithuanian, English and nearly all Romance languages, i would love to have the same interface for all of them and not reinvent the wheel for each language.

    It looks like HunSpell can be used for this - it is Free Software, it already has certain support for understanding parts of speech and it has more or less complete dictionaries for several languages. If i'm already off the mark and should use some other piece of software, stop reading now and tell me what it should be.

    Now, my language of choice is Perl, since programs written in it are usually easier to deploy across different platforms and it is very good for text processing. I couldn't find on CPAN any Perl package that works with HunSpell except Text::HunSpell. Unfortunately Text::HunSpell has several disadvantages: It is hard to install on Windows, since it requires libhunspell, and it is not trivial to install it there. I also tried to install it on Linux, and it didn't compile. Maybe it's just a small makefile problem that can be easily fixed, but even when it will compile it shall probably be able only to spellcheck texts, while i want to use it for actual morphological analysis; but then again, i may be missing something.

    So i started thinking about writing a pure-Perl implementation of HunSpell, that will use the same dictionaries, but possibly do some other things and not just spellchecking. Maybe it won't be a complete inplementation of HunSpell, but only a subset, that would be useful for me. I shall, of course, release it as Free Software, so anyone would be able to extend it.

    For now i already wrote a simple parser for .aff and .dic files. But then suddenly i thought - maybe i should ask the HunSpell developers whether they know some other Perl implementations of HunSpell that don't immediately appear on Google?

    If i go on with this project, one thing that i already think of is that i shall probably try to re-use the existing HunSpell test suite.

    I'll be glad to hear any other ideas.

  • Eleonora

    Eleonora - 2009-12-07
    You can find here a perl implementation of hunspell, that works.
    Consider, however following points:
    - hunspell's morphology implementation is neither stable, nor full, nor documented.
    - Morphologic analysis requires a specifically prepared aff/dic pair, that exists at present only for Hungarian and English; It's preparation is undocumented, the tree itself and it's setup  is on

    Good luck, please feed back here your results, thanks.