Letter frequency in unicode (Was Re: [Indic-computing-devel] Free UCS outline font)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Mar 08, 2002 at 10:51:40AM -0800, Arun Sharma wrote:

[ Context: on the topic of coming up with a "common minimum" glyph set
  for Indian languages ]

> 
> If we had large amounts of representative unicode text available in
> Indian languages, we could've done a frequency analysis to figure out
> which ones were more common.
> 
> I'll try to write something up later today. While we're on the topic,
> any opinions on how programs like "wc" should behave for Indian
> languages ? Should they not count the combination of a consonant and a
> vowel as a character ?

ok, I wrote up a script:

http://www.sharma-home.net/~adsharma/languages/scripts/lf.py

On running the script on this page:

[ <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <meta http-equiv="content-language" content="kn-IN"> ]
http://www.sharma-home.net/~adsharma/languages/kannada/shivarama-karant.html

I get this:

http://www.sharma-home.net/~adsharma/languages/scripts/freq.txt

Interesting stats 

1. The number of times the halant was used  - I guess this is because
   every "vattu" needs one.

2. The dependent vowel "e" came in second (might be similar to English,
   where e is the most frequent letter)

TODO: to count the frequency on a per-syllable basis, rather than a per
character basis. Will need libraries to do the consonant-vowel
composition and then run it through lf.py.

I see some code in Emacs lisp, which is doing such computation:

http://www.mit.edu/afs/athena.mit.edu/project/ptest/emacs/emacs-20.5/lisp/language/devan-util.el

	-Arun