Letter frequency in unicode (Was Re: [Indic-computing-devel] Free UCS outline font)
Status: Alpha
Brought to you by:
jkoshy
From: Arun S. <ar...@sh...> - 2002-03-10 01:19:14
|
On Fri, Mar 08, 2002 at 10:51:40AM -0800, Arun Sharma wrote: [ Context: on the topic of coming up with a "common minimum" glyph set for Indian languages ] > > If we had large amounts of representative unicode text available in > Indian languages, we could've done a frequency analysis to figure out > which ones were more common. > > I'll try to write something up later today. While we're on the topic, > any opinions on how programs like "wc" should behave for Indian > languages ? Should they not count the combination of a consonant and a > vowel as a character ? ok, I wrote up a script: http://www.sharma-home.net/~adsharma/languages/scripts/lf.py On running the script on this page: [ <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta http-equiv="content-language" content="kn-IN"> ] http://www.sharma-home.net/~adsharma/languages/kannada/shivarama-karant.html I get this: http://www.sharma-home.net/~adsharma/languages/scripts/freq.txt Interesting stats 1. The number of times the halant was used - I guess this is because every "vattu" needs one. 2. The dependent vowel "e" came in second (might be similar to English, where e is the most frequent letter) TODO: to count the frequency on a per-syllable basis, rather than a per character basis. Will need libraries to do the consonant-vowel composition and then run it through lf.py. I see some code in Emacs lisp, which is doing such computation: http://www.mit.edu/afs/athena.mit.edu/project/ptest/emacs/emacs-20.5/lisp/language/devan-util.el -Arun |