Re: Letter frequency in unicode (Was Re: [Indic-computing-devel] Free UCS outline font)
Status: Alpha
Brought to you by:
jkoshy
From: Arun S. <ar...@sh...> - 2002-03-11 07:10:09
|
On Sun, Mar 10, 2002 at 10:53:05PM -0800, Arun Sharma wrote: > The above data can be used to > > (a) Design keyboards based on the analysis of which syllables are more > frequent and which syllables often occur next to each other etc. > (b) Publish simplified keyboards and fonts, which contain smaller, more > manageable, but incomplete subsets of the language/script. (c) Cryptanalysis of course :) > > The above code is easily extensible to other Indian languages. All you > need to do is copy and modify kannada.py to indicate the vowels, > consonants and matras in your language. I've added devanagari.py now. > > The code is not very efficient yet. I'm focussing on getting the code > right. > Took 4 mins on a 800 MHz Duron to process 20,000 lines of text. > I'd love to run these scripts on large bodies of unicode text in Indian > languages. Any suggestions on where to get such text ? I ran it on the last 20,000 lines of a UTF-8 encoded English-Hindi dictionary. http://www.sharma-home.net/~adsharma/languages/scripts/dict.txt For those who can't read unicode, top 5 syllables: 1. ra 2. ka 3. nA 4. ta 5. pa -Arun |