Re: Letter frequency in unicode (Was Re: [Indic-computing-devel] Free UCS outline font)
Status: Alpha
Brought to you by:
jkoshy
From: Arun S. <ar...@sh...> - 2002-03-11 06:47:25
|
On Sat, Mar 09, 2002 at 05:24:45PM -0800, Arun Sharma wrote: > TODO: to count the frequency on a per-syllable basis, rather than a per > character basis. Will need libraries to do the consonant-vowel > composition and then run it through lf.py. I finished this work today. Please review the state machine I used to do the composition: http://www.sharma-home.net/~adsharma/languages/scripts/state-machine.jpg The code: http://www.sharma-home.net/~adsharma/languages/scripts/lf.py http://www.sharma-home.net/~adsharma/languages/scripts/kannada.py http://www.sharma-home.net/~adsharma/languages/scripts/indian.py The result of running the above code on: http://www.sharma-home.net/~adsharma/languages/kannada/shivarama-karant.html is here: http://www.sharma-home.net/~adsharma/languages/scripts/freq.txt The above data can be used to (a) Design keyboards based on the analysis of which syllables are more frequent and which syllables often occur next to each other etc. (b) Publish simplified keyboards and fonts, which contain smaller, more manageable, but incomplete subsets of the language/script. The above code is easily extensible to other Indian languages. All you need to do is copy and modify kannada.py to indicate the vowels, consonants and matras in your language. The code is not very efficient yet. I'm focussing on getting the code right. Python specific issues: 1. Python assumes that the input.py file is ASCII. Specifying unicode literals requires usage of this idiom: x = unicode("foobar", "utf8") 2. Printing unicode text is done as follows: print x.encode("utf8") If there is enough interest, I can collect all this code (and other language specific modules that you may contribute) and try to get them included in the standard python distribution. I'd love to run these scripts on large bodies of unicode text in Indian languages. Any suggestions on where to get such text ? -Arun |