Extreme agglutination Questions for Quechua

Help
amosbatto
2006-08-03
2013-06-03
  • amosbatto
    amosbatto
    2006-08-03

    Hi hunspell developers,

    We are trying to create a hunspell dictionary for Quechua (runasimi), an
    indigenous language of the Andes, spoken by roughly 10 million people. There are
    many dialects but we are going to use the Cusco dialect from Peru.
    Unfortunately, most people who speak Quechua, have no idea how to write the
    language.  Generally they write the words, sounding them out in the Spanish
    alphabet which consists of the following letters:
    a,b,c,ch,d,e,f,h,i,j,k,l,ll,m,n,ñ,o,p,q,r,rr,s,t,u,v,w,x,y,z
    Hopefully with a proper spell checker, we can help train Quechua
    speakers how to write in the Quechua alphabet which consists of the following
    letters:
    a,ch,chh,ch',e,h,i,k,kh,k',l,ll,m,n,o,p,ph,p',q,qh,q',r,s,t,th,t',u,w,y

    For instance, Quechua speakers who are literate in Spanish will write the word
    for "house" as "huasi" or "guasi". With a Quechua spell checker, they will learn
    to write the word properly as "wasi". Most people learn to write in the same way
    that they learn to speak a language--not through rules, but through trial and
    repeated error correction. So a spell checker is an excellent way to teach
    people to spell correctly.

    We have investigated the other spelling formats (aspell, ispell, old myspell),
    but only hunspell will work as bec we need the agglutinative suffix
    features of hunspell for Quechua spell checking.  Quechua can have an almost
    infinite number of combinations of suffixes.  You can have words with as many
    as 5 or 6 suffixes. The problem is that there is an order to the way that
    suffixes can be added together, but some suffixes can occupy different places
    in the order. It is a nightmare to try and list all the possiblities. I tried to
    create an ispell dictionary for Southern Bolivian Quechua 2 years ago with very
    ugly results. My affix file was 27,000 lines long when I finally gave up and
    decided that it was impossible to cover the language with ispell. If you want to
    see how ugly an ispell affix file can get, you can download it at:
    www.ciber-runa.org/qu-BO-0.02-0.zip

    I also had to create a special program to insert infixes into the words in the
    word list, so a word list of 7000 lines expanded to over 50,000 lines.  It could
    have easily grown to over 200,000 lines if I had bothered to cover all the
    possible verbal infixes, but I decided to just cover the most basic verbal
    infixes. The people at aspell transformed my ispell dictionary into aspell and
    posted it on ftp.gnu.org, but nobody has ever used it as far as I can tell. 

    I sent the dictionary to the AbiWord developer list twice, but they never
    incorporated it into their program. I have no idea whether the massive affix
    file caused them to reject it, or if it was simply oversight on their part.
    I made 3 requests to the OpenOffice people that they add Quechua as a language
    option, so we could incorporate our spell checker. My messages kept getting
    forwarded on to other openoffice lists, but nobody ever responded as to how
    to get quechua added as a language code in openoffice.  It was a very
    frustrating experience to say the least.  In the end, I gave up trying to
    get my Bolivian Quechua ispell/aspell dictionary incorporated into AbiWord and
    OpenOffice, because it didn't cover a lot of the language, so I figured that
    it wouldn't be that helpful anyway.  Another major problem is that I created the
    word list with the dictionary written by Jesus Lara in the 1940-1950s. Its
    spelling style is now out-of-date, and nobody has written an up-to-date
    dictionary for Bolivian Quechua in the correct alphabet which is being used
    today.  

    But the situation is totally different with Peruvian Quechua, where there are
    good dictionaries for the Cusco dialect, written in a good alphabet which
    people currently use. Right now we are forming a group in Peru to translate
    a lot of free software programs (AbiWord, Firefox, and eventually OpenOffice)
    into Quechua, so we need a spell checker as well. 

    We had pretty much given up on spell-checking in Quechua until we found your
    hunspell program and have decided to give it a try.  In order to
    understand how difficult this is going to be, take a look at how a quechua
    verb is formed:

    verb root + ~15 possible verbal infixes (~50 combinations of infixes) +
    2 progressive forms + ~100 combinations of person, number, and tense +
    ~20 possible suffixes + ~20 possible suffixes + ~20 possible suffixes

    It is possible to have more than three verbal suffixes, but so rare that we
    aren't going to bother trying to cover combinations with more than 3 suffixes.
    In the case of most suffixes, they can appear as the first, middle or last
    suffix, but the order changes according to which suffixes are used. For
    instance, if the suffix "manta" and "pacha" appear together, then "manta" is
    before "pacha".  A few suffixes, on the other hand, can only be used as the last
    suffix.

    This is how we are thinking of implementing verbs in hunspell. Here is how
    I'm proposing to set it up:

    In the word list file:
    --------------------------
    verb roots:
    All verb roots will have the COMPOUNDBEGIN flag. Some verb roots can also be
    used as nouns, but most of the verb roots will also have the ONLYINCOMPOUND
    flag.

    ~50 verbal infix combinations:
    50 compound words with COMPOUNDMIDDLE and ONLYINCOMPOUND flags.

    2 progressive forms + ~100 combinations of person, number and tense:
    This will be added together to form ~300 compound words with COMPOUNDEND
    and ONLYINCOMPOUND flags.  In addition these words will have ~20 suffix flags.

    (There will be ~300, because ~100 without progressive, ~100 with "sha"
    progressive, and ~100 with "sa" progressive)
    ---------------------------

    In the affix file:
    ---------------------------
    first ~20 suffixes:
    ~20 flags, all with addition suffix flags for double suffixes
      
    Second ~20 suffixes + third ~20 suffixes:
    These suffixes will be combined together for less then ~400 flags.

    (There will be less than ~400 flags because some suffixes don't combine with
    other suffixes.)
    ---------------------------

    At this point, that is how we are thinking of doing it. Do you foresee any
    problems? Will hunspell be able to handle it?  Will hunspell choke or slow down
    to a crawl with so many compounds words and suffix flags?

    With nouns, adjectives, and adverbs, we will not need to use any compound
    words because they are less complex.  We can use most of the same suffix
    flags for verbs to represent combinations of up to 3 suffixes.

    Apart from how to represent agglutination of infixes and suffixes, we also have
    the problem of how to catch spelling mistakes for confusable letters in
    quechua. In quechua there is an on-going debate about whether to use 3 or 5
    vowels. The vowels "i" and "e" can often be interchanged and so can the vowels
    "o" and "u". It will be relatively easy to represent this with hunspell's REP
    command:

    REP i e
    REP e i
    REP o u
    REP u o

    These vowels are the most common letters in Quechua.  Do you forsee a major
    performance problem if hunspell has to transform so many letters?

    Because these vowells are highly confusable, some Quechua linguists prefer to
    only use the 3 vowels "a", "i", and "u".  We are going to use 5 vowels, because
    that is the more standard style here in Peru, but anyone who wants to only use
    3 vowels can easily transform our 5 vowel spelling dictionary into a 3 vowel
    dictionary with 2 simple global search and replace commands.  On the other hand,
    if we implement our dictionary in 3 vowels, it will be very difficult to
    transform it into a 5 vowel dictionary. Hopefully in this way, we can satisfy
    both the 3 and 5 vowel camps.

    It is relatively easy to represent confusable vowels in the hunspell format,
    but it becomes more difficult to represent the confusable consonants ("ch","k",
    "p","q", and "t") which have a normal form, an aspirated form and a glotallized
    form.

    For instance, in Quechua there are 6 |k| sounds which are readily confusable:
    k  (|k| high in the throat)  
    kh (aspirated k),
    k' (glotalized k),
    q  (|k| deep in the throat),
    qh (aspirated q)
    q' (glottalized q)

    Quechua speakers will often confuse these different sounds when writing and
    some dictionaries even list different spelling for the same word. For instance,
    in some dictionaries, the word for "young man" is "kari" and in others it is
    "qari". Likewise, in some dictionaries, the word "to write" is spelled "qhelqhey"
    and in others it is spelled "qelqey".  We are only going to allow one spelling
    for these words in our hunspell dictionary, because we would like everyone to
    standardize around one spelling for "young man" as "qari" and "to write" as
    "qelqey".

    In addition, quechua speakers often mispell the |k| sound, using with the
    spanish alphabet. In spanish, |k| is represented by the letter "c" (if
    followed by an "a", "o", or "u") or by "qu" (if followed by an "e" or "i").

    The tricky part is that there is no universal rule, for when to replace one |k|
    spelling with another |k| spelling.  In aspell, with its "sounds like" feature
    we could easily just transform all k-like sounds into k, so the spell-checker
    could easily find all possible matches. For instance in aspell:

    #c is used in ch, so can't just transform c into k, or will confuse with kh
    ca  => ka    
    co  => ko
    cu  => ku
    que => ke
    qui => ki
    kh  => k
    k'  => k
    q   => k
    qh  => k
    q'  => k

    So it didn't matter if the user spelled the word "to write" as "qhelqhey",
    "qelqey", "q'elq'ey", "khelkhey", or "quelquey".  The spell checker would
    evaluate all the input as |kelkey| and then return the correct spelling
    "qelqey".

    Is it possible to do something similar in hunspell?

    The documentation in man 4 hunspell doesn't give any details about how
    the REP command works. Are the changes cumulative?  For instance, if I have:

    REP shon tion
    REP dit dict

    and I write the word "dishonary". Does hunspell transform it first to
    "ditionary", and then transform "ditionary" to "dictionary"?
    Or does hunspell, transform "dishonary" to "ditionary" and then stop?

    Is it possible to have multiple REP commands with the same string?
    For instance can you do this?

    REP shon tion
    REP shon gion

    If I pass the word "dicshionary", it will get transformed to "dictionary" and
    if I pass the word "reshon", it will get transformed to "region"? 

    I wrote out a really long REP table like this:

    REP ca ka
    REP ca k'a
    REP ca kha
    REP ca qa
    REP ca q'a
    REP ca qha

    REP co ko
    REP co k'o
    REP co kho
    REP co qo
    REP co q'o
    REP co qho

    REP cu ku
    REP cu k'u
    REP cu khu
    REP cu ku
    REP cu k'u
    REP cu khu

    REP que ke
    REP que k'e
    REP que khe
    REP que qe
    REP que q'e
    REP que qhe

    REP qui ki
    REP qui k'i
    REP qui khi
    REP qui qi
    REP qui q'i
    REP qui qhi

    REP k' k
    REP k' kh
    REP k' q
    REP k' qh
    REP k' q'

    REP kh k
    REP kh k'
    REP kh q
    REP kh qh
    REP kh q'

    REP k kh
    REP k k'
    REP k q
    REP k qh
    REP k q'

    REP q' k
    REP q' kh
    REP q' k'
    REP q' q
    REP q' qh

    REP qh k
    REP qh kh
    REP qh k'
    REP qh q
    REP qh q'

    REP q k   cocha qocha
    REP q kh
    REP q k'
    REP q qh
    REP q q'

    Then I realized that I don't have any idea whether hunspell would allow multiple
    replacements for the same string. And even if it does work, would it take up so
    much processing time that it would be undesirable?

    Thanks for any advice you can give me (or any good luck charms you can
    pass my way),
    Amos Batto
    www.ciber-runa.org

    PS: I know that you guys probably thought that you had finally solved spell
    checking for agglutinitive languages, but as you can see, there are languages a
    lot more complicated than Hungarian.  And Quechua isn't unique in this regard.
    The other major language spoken in the Andes, Aymara, is just as complicated.
    I hear that some of the Southern African languages have the same problems with
    agglutination as Quechua.  Somebody will probably have to sit down and write a
    special spell checking program just for extreme agglutinative languages like
    Quechua.

    Using hunspell's COMPOUNDMIDDLE flag to implement verbal infixes is a really
    ugly way to use hunspell. If I were going to design the ideal spell checker
    for Quechua, it would have a special infix flag that allowed for double infixes
    and triple infixes to be combined together.  Similarly with suffixes, it would
    allow triple suffixes.  I'm not sure, however, if I would be able to handle
    all the weird order rules for suffixes and infixes.  On the other hand, I am
    sure that calculating all the possible combinations would hog a ton of memory
    and processing cycles.  I don't understand quite how hunspell works, but I
    imagine that we are talking about cubing the combinations of infixes and
    cubing the combinations of suffixes, plus all the combinations of those two
    together.  It becomes really hairy, really fast.

     
  • Eleonora
    Eleonora
    2009-09-30

    Hello,

    Hungarian has also suffix dept of 5, e.g.:
    ház-a-i-é-i-nek  (of their houses')

    This is solved purely by affixing, please check the Hungarian affix file.

     
  • Eleonora
    Eleonora
    2009-09-30

    Also: It is impossible to write such an affix file manually, you must use some scripting for this, like perl or awk, java, etc… I suggest perl.

     
  • Arno Teigseth
    Arno Teigseth
    2009-11-06

    Hi

    I'm developing a Kichwa dictionary for the Ecuadorian shukllachiska Kichwa, a nationwide unified ortography. You can get it here:

    http://extensions.services.openoffice.org/files/2121/3/qu_EC.oxt

    It's actually a zip file, so you can open it with file-roller (ubuntu) or perhaps http://www.7-zip.org/ if you use windows.

    There you can see some of my reasoning. Perhaps there are better ways of dealing with the multi-affix-problem, but the best way I've found is to write a MASTER .dic file and run a script over it to create the possible different stems.

    In the MASTER file there's for example
    rurana//r>+-,whv # to do
    and after running the script over it we get in the .dic file:<br>
    rurachikuna//v<br>
    rurachinakuna//v<br>
    rurachina//v<br>
    rurachiwana//v<br>
    rurakrina//v<br>
    rurakuna//v<br>
    ruramukuna//v<br>
    ruramuna//v<br>
    ruramuwana//v<br>
    ruranakuna//v<br>
    ruranamukuna//v<br>
    rurana//v<br>
    rurarana//v<br>
    rurarichina//v<br>
    rurarikuna//v<br>
    rurarimuna//v<br>
    rurarina//v<br>
    rurawana//v<br>
    … and all of those can be conjugated in sex and number by hunspell.

    If you remove a flag or two from the MASTER file, the verb will not be converted into that many stems.

    anchuchina//r>,wv # remove
    will for example just become
    anchuchikrina//v<br>
    anchuchimuna//v<br>
    anchuchimuwana//v<br>
    anchuchina//v<br>
    anchuchirimuna//v<br>
    anchuchirina//v<br>
    anchuchiwana//v<br>

    I don't know if all these are possible combinations, but at least I have a generator to build them for me :)

    As for -ta- -mi- -ka- and other infixes, I've used the COMPOUND keywords. Works quite well.

     
  • Eleonora
    Eleonora
    2009-11-06

    I wonder, why you do not:
    1. choose one dialect with a relatively fixed orthography (do not handle multiple orthographies first, at least not much of them).
    2. use strictly only prefix and suffix feature. Do not use compund words and reps first.
    3. 3. Use scripting (perl, awk, java) to create you affix file. It is hopeless in this complexity manually.

    This is the way, Hungarian spell checking started, and the results were really encouraging. Compounding and rep facility came much later.

    I honestly can not see, why you consider your language being more complicated, than Hungarian. If you bring some concrete examples, we can discuss the proper way to solve spell checking for  the given word class.

    Ispell itself  is very powerful, and hunspell/aspell, which have in fact the same features, are a bit more powerful. In my opinion any language can be handled with ispell, but hunspell/aspell add a bit more elegancy to the hanling.

     
  • Eleonora
    Eleonora
    2009-11-07

    To arnotixe:
    I checked your aff/dic list. It is very interesting. I am not sure, weather your using compounding does not:
    1. increase unnecessarily time
    2. make the whole thing inflexibel.

    It would be good, if you added some test to it: words that should be considered as bad ones, and words, that should be considered as correct ones. If I had that, I could make some test, that prove or disprove my assumptions above.

     
  • Arno Teigseth
    Arno Teigseth
    2009-11-08

    hi tyuk

    Thanks a lot for your comments. I have to say that I've not gotten very far with this project yet, but in a few months I'm moving to Ecuador and that might just speed up the progress and my own comprehension of the language.

    Well for the scripting part, I actually use a generateQUdicfile.sh bash script to generate the .dic file.

    For the use of compound: It's because we found that hunspell could not handle more than two affixes?
    https://sourceforge.net/projects/hunspell/forums/forum/480971/topic/3024781
    https://sourceforge.net/projects/hunspell/forums/forum/480971/topic/3030144

    of course this may have been fixed since the above posts were written, or your approach may be better. I'll have a look soon.

    best
    Arno

     
  • Eleonora
    Eleonora
    2009-11-08

    Thanks for the reply.
    Yes, the dic file contains the nominal forms of the words, but I thought, a test file would contain some agglutinated forms.  (The more depth the better)
    I tried to use unmunch to create some tests, however, I got this:
    adorachikuna//+
    adorachikuni//+
    adorachikunki//+
    adorachikun//+
    adorachikunchik//+
    adorachikunkichik//+

    adorachiwashkakuna//+
    adorachiwashpa//+
    adorachiwashkanka//+

    Not sure if this is a complete list.

    I used this command:
    /path_to/unmunch /path_to/qu_EC.dic /path_to/qu_EC.aff> /tmp/x

    Is the list complete and is that what you expect? In a suffix/prefix word list we get as a result words without trailing //+-s therefore I assume, unmunch can not work properly with the compound logic qu_EC uses.

    There is no need for a language of depth 6 agglutinating (Hungarian for example) for more than one affix level. With 2 affix levels it is just a kind of luxus, that's all.