Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
Hi hunspell developers,
We are trying to create a hunspell dictionary for Quechua (runasimi), an
indigenous language of the Andes, spoken by roughly 10 million people. There are
many dialects but we are going to use the Cusco dialect from Peru.
Unfortunately, most people who speak Quechua, have no idea how to write the
language. Generally they write the words, sounding them out in the Spanish
alphabet which consists of the following letters:
Hopefully with a proper spell checker, we can help train Quechua
speakers how to write in the Quechua alphabet which consists of the following
For instance, Quechua speakers who are literate in Spanish will write the word
for "house" as "huasi" or "guasi". With a Quechua spell checker, they will learn
to write the word properly as "wasi". Most people learn to write in the same way
that they learn to speak a language--not through rules, but through trial and
repeated error correction. So a spell checker is an excellent way to teach
people to spell correctly.
We have investigated the other spelling formats (aspell, ispell, old myspell),
but only hunspell will work as bec we need the agglutinative suffix
features of hunspell for Quechua spell checking. Quechua can have an almost
infinite number of combinations of suffixes. You can have words with as many
as 5 or 6 suffixes. The problem is that there is an order to the way that
suffixes can be added together, but some suffixes can occupy different places
in the order. It is a nightmare to try and list all the possiblities. I tried to
create an ispell dictionary for Southern Bolivian Quechua 2 years ago with very
ugly results. My affix file was 27,000 lines long when I finally gave up and
decided that it was impossible to cover the language with ispell. If you want to
see how ugly an ispell affix file can get, you can download it at:
I also had to create a special program to insert infixes into the words in the
word list, so a word list of 7000 lines expanded to over 50,000 lines. It could
have easily grown to over 200,000 lines if I had bothered to cover all the
possible verbal infixes, but I decided to just cover the most basic verbal
infixes. The people at aspell transformed my ispell dictionary into aspell and
posted it on ftp.gnu.org, but nobody has ever used it as far as I can tell.
I sent the dictionary to the AbiWord developer list twice, but they never
incorporated it into their program. I have no idea whether the massive affix
file caused them to reject it, or if it was simply oversight on their part.
I made 3 requests to the OpenOffice people that they add Quechua as a language
option, so we could incorporate our spell checker. My messages kept getting
forwarded on to other openoffice lists, but nobody ever responded as to how
to get quechua added as a language code in openoffice. It was a very
frustrating experience to say the least. In the end, I gave up trying to
get my Bolivian Quechua ispell/aspell dictionary incorporated into AbiWord and
OpenOffice, because it didn't cover a lot of the language, so I figured that
it wouldn't be that helpful anyway. Another major problem is that I created the
word list with the dictionary written by Jesus Lara in the 1940-1950s. Its
spelling style is now out-of-date, and nobody has written an up-to-date
dictionary for Bolivian Quechua in the correct alphabet which is being used
But the situation is totally different with Peruvian Quechua, where there are
good dictionaries for the Cusco dialect, written in a good alphabet which
people currently use. Right now we are forming a group in Peru to translate
a lot of free software programs (AbiWord, Firefox, and eventually OpenOffice)
into Quechua, so we need a spell checker as well.
We had pretty much given up on spell-checking in Quechua until we found your
hunspell program and have decided to give it a try. In order to
understand how difficult this is going to be, take a look at how a quechua
verb is formed:
verb root + ~15 possible verbal infixes (~50 combinations of infixes) +
2 progressive forms + ~100 combinations of person, number, and tense +
~20 possible suffixes + ~20 possible suffixes + ~20 possible suffixes
It is possible to have more than three verbal suffixes, but so rare that we
aren't going to bother trying to cover combinations with more than 3 suffixes.
In the case of most suffixes, they can appear as the first, middle or last
suffix, but the order changes according to which suffixes are used. For
instance, if the suffix "manta" and "pacha" appear together, then "manta" is
before "pacha". A few suffixes, on the other hand, can only be used as the last
This is how we are thinking of implementing verbs in hunspell. Here is how
I'm proposing to set it up:
In the word list file:
All verb roots will have the COMPOUNDBEGIN flag. Some verb roots can also be
used as nouns, but most of the verb roots will also have the ONLYINCOMPOUND
~50 verbal infix combinations:
50 compound words with COMPOUNDMIDDLE and ONLYINCOMPOUND flags.
2 progressive forms + ~100 combinations of person, number and tense:
This will be added together to form ~300 compound words with COMPOUNDEND
and ONLYINCOMPOUND flags. In addition these words will have ~20 suffix flags.
(There will be ~300, because ~100 without progressive, ~100 with "sha"
progressive, and ~100 with "sa" progressive)
In the affix file:
first ~20 suffixes:
~20 flags, all with addition suffix flags for double suffixes
Second ~20 suffixes + third ~20 suffixes:
These suffixes will be combined together for less then ~400 flags.
(There will be less than ~400 flags because some suffixes don't combine with
At this point, that is how we are thinking of doing it. Do you foresee any
problems? Will hunspell be able to handle it? Will hunspell choke or slow down
to a crawl with so many compounds words and suffix flags?
With nouns, adjectives, and adverbs, we will not need to use any compound
words because they are less complex. We can use most of the same suffix
flags for verbs to represent combinations of up to 3 suffixes.
Apart from how to represent agglutination of infixes and suffixes, we also have
the problem of how to catch spelling mistakes for confusable letters in
quechua. In quechua there is an on-going debate about whether to use 3 or 5
vowels. The vowels "i" and "e" can often be interchanged and so can the vowels
"o" and "u". It will be relatively easy to represent this with hunspell's REP
REP i e
REP e i
REP o u
REP u o
These vowels are the most common letters in Quechua. Do you forsee a major
performance problem if hunspell has to transform so many letters?
Because these vowells are highly confusable, some Quechua linguists prefer to
only use the 3 vowels "a", "i", and "u". We are going to use 5 vowels, because
that is the more standard style here in Peru, but anyone who wants to only use
3 vowels can easily transform our 5 vowel spelling dictionary into a 3 vowel
dictionary with 2 simple global search and replace commands. On the other hand,
if we implement our dictionary in 3 vowels, it will be very difficult to
transform it into a 5 vowel dictionary. Hopefully in this way, we can satisfy
both the 3 and 5 vowel camps.
It is relatively easy to represent confusable vowels in the hunspell format,
but it becomes more difficult to represent the confusable consonants ("ch","k",
"p","q", and "t") which have a normal form, an aspirated form and a glotallized
For instance, in Quechua there are 6 |k| sounds which are readily confusable:
k (|k| high in the throat)
kh (aspirated k),
k' (glotalized k),
q (|k| deep in the throat),
qh (aspirated q)
q' (glottalized q)
Quechua speakers will often confuse these different sounds when writing and
some dictionaries even list different spelling for the same word. For instance,
in some dictionaries, the word for "young man" is "kari" and in others it is
"qari". Likewise, in some dictionaries, the word "to write" is spelled "qhelqhey"
and in others it is spelled "qelqey". We are only going to allow one spelling
for these words in our hunspell dictionary, because we would like everyone to
standardize around one spelling for "young man" as "qari" and "to write" as
In addition, quechua speakers often mispell the |k| sound, using with the
spanish alphabet. In spanish, |k| is represented by the letter "c" (if
followed by an "a", "o", or "u") or by "qu" (if followed by an "e" or "i").
The tricky part is that there is no universal rule, for when to replace one |k|
spelling with another |k| spelling. In aspell, with its "sounds like" feature
we could easily just transform all k-like sounds into k, so the spell-checker
could easily find all possible matches. For instance in aspell:
#c is used in ch, so can't just transform c into k, or will confuse with kh
ca => ka
co => ko
cu => ku
que => ke
qui => ki
kh => k
k' => k
q => k
qh => k
q' => k
So it didn't matter if the user spelled the word "to write" as "qhelqhey",
"qelqey", "q'elq'ey", "khelkhey", or "quelquey". The spell checker would
evaluate all the input as |kelkey| and then return the correct spelling
Is it possible to do something similar in hunspell?
The documentation in man 4 hunspell doesn't give any details about how
the REP command works. Are the changes cumulative? For instance, if I have:
REP shon tion
REP dit dict
and I write the word "dishonary". Does hunspell transform it first to
"ditionary", and then transform "ditionary" to "dictionary"?
Or does hunspell, transform "dishonary" to "ditionary" and then stop?
Is it possible to have multiple REP commands with the same string?
For instance can you do this?
REP shon tion
REP shon gion
If I pass the word "dicshionary", it will get transformed to "dictionary" and
if I pass the word "reshon", it will get transformed to "region"?
I wrote out a really long REP table like this:
REP ca ka
REP ca k'a
REP ca kha
REP ca qa
REP ca q'a
REP ca qha
REP co ko
REP co k'o
REP co kho
REP co qo
REP co q'o
REP co qho
REP cu ku
REP cu k'u
REP cu khu
REP cu ku
REP cu k'u
REP cu khu
REP que ke
REP que k'e
REP que khe
REP que qe
REP que q'e
REP que qhe
REP qui ki
REP qui k'i
REP qui khi
REP qui qi
REP qui q'i
REP qui qhi
REP k' k
REP k' kh
REP k' q
REP k' qh
REP k' q'
REP kh k
REP kh k'
REP kh q
REP kh qh
REP kh q'
REP k kh
REP k k'
REP k q
REP k qh
REP k q'
REP q' k
REP q' kh
REP q' k'
REP q' q
REP q' qh
REP qh k
REP qh kh
REP qh k'
REP qh q
REP qh q'
REP q k cocha qocha
REP q kh
REP q k'
REP q qh
REP q q'
Then I realized that I don't have any idea whether hunspell would allow multiple
replacements for the same string. And even if it does work, would it take up so
much processing time that it would be undesirable?
Thanks for any advice you can give me (or any good luck charms you can
pass my way),
PS: I know that you guys probably thought that you had finally solved spell
checking for agglutinitive languages, but as you can see, there are languages a
lot more complicated than Hungarian. And Quechua isn't unique in this regard.
The other major language spoken in the Andes, Aymara, is just as complicated.
I hear that some of the Southern African languages have the same problems with
agglutination as Quechua. Somebody will probably have to sit down and write a
special spell checking program just for extreme agglutinative languages like
Using hunspell's COMPOUNDMIDDLE flag to implement verbal infixes is a really
ugly way to use hunspell. If I were going to design the ideal spell checker
for Quechua, it would have a special infix flag that allowed for double infixes
and triple infixes to be combined together. Similarly with suffixes, it would
allow triple suffixes. I'm not sure, however, if I would be able to handle
all the weird order rules for suffixes and infixes. On the other hand, I am
sure that calculating all the possible combinations would hog a ton of memory
and processing cycles. I don't understand quite how hunspell works, but I
imagine that we are talking about cubing the combinations of infixes and
cubing the combinations of suffixes, plus all the combinations of those two
together. It becomes really hairy, really fast.
Hungarian has also suffix dept of 5, e.g.:
ház-a-i-é-i-nek (of their houses')
This is solved purely by affixing, please check the Hungarian affix file.
Also: It is impossible to write such an affix file manually, you must use some scripting for this, like perl or awk, java, etc… I suggest perl.
I'm developing a Kichwa dictionary for the Ecuadorian shukllachiska Kichwa, a nationwide unified ortography. You can get it here:
It's actually a zip file, so you can open it with file-roller (ubuntu) or perhaps http://www.7-zip.org/ if you use windows.
There you can see some of my reasoning. Perhaps there are better ways of dealing with the multi-affix-problem, but the best way I've found is to write a MASTER .dic file and run a script over it to create the possible different stems.
In the MASTER file there's for example
rurana//r>+-,whv # to do
and after running the script over it we get in the .dic file:<br>
… and all of those can be conjugated in sex and number by hunspell.
If you remove a flag or two from the MASTER file, the verb will not be converted into that many stems.
anchuchina//r>,wv # remove
will for example just become
I don't know if all these are possible combinations, but at least I have a generator to build them for me :)
As for -ta- -mi- -ka- and other infixes, I've used the COMPOUND keywords. Works quite well.
I wonder, why you do not:
1. choose one dialect with a relatively fixed orthography (do not handle multiple orthographies first, at least not much of them).
2. use strictly only prefix and suffix feature. Do not use compund words and reps first.
3. 3. Use scripting (perl, awk, java) to create you affix file. It is hopeless in this complexity manually.
This is the way, Hungarian spell checking started, and the results were really encouraging. Compounding and rep facility came much later.
I honestly can not see, why you consider your language being more complicated, than Hungarian. If you bring some concrete examples, we can discuss the proper way to solve spell checking for the given word class.
Ispell itself is very powerful, and hunspell/aspell, which have in fact the same features, are a bit more powerful. In my opinion any language can be handled with ispell, but hunspell/aspell add a bit more elegancy to the hanling.
I checked your aff/dic list. It is very interesting. I am not sure, weather your using compounding does not:
1. increase unnecessarily time
2. make the whole thing inflexibel.
It would be good, if you added some test to it: words that should be considered as bad ones, and words, that should be considered as correct ones. If I had that, I could make some test, that prove or disprove my assumptions above.
Thanks a lot for your comments. I have to say that I've not gotten very far with this project yet, but in a few months I'm moving to Ecuador and that might just speed up the progress and my own comprehension of the language.
Well for the scripting part, I actually use a generateQUdicfile.sh bash script to generate the .dic file.
For the use of compound: It's because we found that hunspell could not handle more than two affixes?
of course this may have been fixed since the above posts were written, or your approach may be better. I'll have a look soon.
Thanks for the reply.
Yes, the dic file contains the nominal forms of the words, but I thought, a test file would contain some agglutinated forms. (The more depth the better)
I tried to use unmunch to create some tests, however, I got this:
Not sure if this is a complete list.
I used this command:
/path_to/unmunch /path_to/qu_EC.dic /path_to/qu_EC.aff> /tmp/x
Is the list complete and is that what you expect? In a suffix/prefix word list we get as a result words without trailing //+-s therefore I assume, unmunch can not work properly with the compound logic qu_EC uses.
There is no need for a language of depth 6 agglutinating (Hungarian for example) for more than one affix level. With 2 affix levels it is just a kind of luxus, that's all.