#195 Use hunspell for Dzongkha (like TIbetan)

open
nobody
None
5
2011-05-26
2011-05-26
No

Hi,

I am starting this project https://sourceforge.net/projects/dzongkha-dict/

I think that hunspell will work so the poject above is to make a hunspell dictionary. But first, I think, I will need to make a version of hunspell that an find Dzongkha words. The syllables are all connected with a "tsek" (little period) and I have an idea to use a simple "longest word" algorithm to find the Dzongkha words. What do you think? Are you interested? Here is how the algorithm would work (below).

Here is what Dzongkha looks like:
http://www.dzongkha.gov.bt/intro/index.html

Here is some high-level information on Dzongkha:
http://www.omniglot.com/writing/dzongkha.php

Here is my "biggest word" algorithm:

When I translate Dzongkha myself, I use this same algorithm, "biggest
word", just like the spell checker would, I just start with the first
syllable and see if the second, third, etc... are part of a single
word. Then, after making the largest word that I can make, I start
over again with the first syllable after the biggest word that I could
find.

Now I will give you an example of exactly how the suggested algorithm
would work. Take the Dzongkha exampleྪ
འབྲུག་གི་རྒྱལ་ཡོངས་སྐད་ཡིག་རྫོང་ཁ་འདི་ དུས་

Looking in your Dzongkha-English dictionary, I first look for འབྲུག I
find "dragon/Bhutan", so the spell checker accepts this syllable as
correct but tries to make a bigger word, now it checks to see if
འབྲུག་གི is also a word, it is not. So it accepts འབྲུག, and starts
with གི. གི by itself is not a word in our dictionary. I believe it
is the genitive particle (from Tibetan), which we would add to the
Dzongkha spell checker dictionary file, so གི is accepted, གི་རྒྱལ is
not found, and so we start over with the next syllable རྒྱལ.

རྒྱལ is an interesting case, by itself, it is not a word, but many
words start with this syllable, so we say it is OK if we can match
more adjacent syllables. The next syllable is ཡོངས so next we look
for རྒྱལ་ཡོངས, we find this adjective "national", so we accept it, and
using our biggest-word algorithm, next we try རྒྱལ་ཡོངས་སྐད. We do
not find it, but, our algorithm has a feature where it will check at
least 20 syllables forward before giving up and putting a red line
under རྒྱལ་ཡོངས་སྐད, in case there is a longer word waiting for us...
...so next we try རྒྱལ་ཡོངས་སྐད་ཡིག, and we find it, "national
language". We accept this word, we try རྒྱལ་ཡོངས་སྐད་ཡིག་རྫོང, it is
not a word, so we accept རྒྱལ་ཡོངས་སྐད་ཡིག and move on.

Next we check for རྫོང. This is a word "fortress" so we accept it and
try རྫོང་ཁ (biggest word algorithm). We find it "national language of
Bhutan". So we try རྫོང་ཁ་འདི. We do not find it, so we accept
རྫོང་ཁ, and start with འདི... ...I am guessing, as in Tibetan, this
is the demonstrative pronoun? So we would add this to our dictionary
file, so we accept འདི and we move on. འདི་དུས is not a word, so we
accept འདི and move on.

Next we start with དུས, perhaps this is the particle of general
subordination (from Tibetan)? So we accept it as is, and move on.

Thanks,

Tom

Discussion