Re: Letter frequency in unicode (Was Re: [Indic-computing-devel] Free UCS outline font)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sun, Mar 10, 2002 at 10:53:05PM -0800, Arun Sharma wrote:
> The above data can be used to 
> 
> (a) Design keyboards based on the analysis of which syllables are more
>     frequent and which syllables often occur next to each other etc.
> (b) Publish simplified keyboards and fonts, which contain smaller, more
>     manageable, but incomplete subsets of the language/script. 

(c) Cryptanalysis of course :)

> 
> The above code is easily extensible to other Indian languages. All you
> need to do is copy and modify kannada.py to indicate the vowels,
> consonants and matras in your language.

I've added devanagari.py now.

> 
> The code is not very efficient yet. I'm focussing on getting the code 
> right.
> 

Took 4 mins on a 800 MHz Duron to process 20,000 lines of text.

> I'd love to run these scripts on large bodies of unicode text in Indian
> languages. Any suggestions on where to get such text ?

I ran it on the last 20,000 lines of a UTF-8 encoded 
English-Hindi dictionary.

http://www.sharma-home.net/~adsharma/languages/scripts/dict.txt

For those who can't read unicode, top 5 syllables:

1. ra
2. ka
3. nA
4. ta
5. pa

	-Arun