[Indic-computing-devel] Generic transliteration rule format for Indian languages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

In an attempt to understand how transliteration could be done for majority
of the Indian languages with generic principles, I have studied some of the
South Indian languages and of course Hindi and came up with a generic
framework to specify the transliteration rules (in the form of a set of
grammar rules) for tranaliterating one 'word'. In this mail, I explain the
format of the transliteration rule file. I have the following files which I
can send thru email to anyone who would like to take a look ( I will put
them up in SourceForge in a few days). 

The fontmap files & and the transliteration rule files :
	cdac-kan-fontmap.txt - fontmap file for CDAC's KN-TTUma-Normal font
	cdac-tam-fontmap.txt - fontmap file for CDAC's TM-TTValluvar-Normal (Tamil)
	cdac-tel-fontmap.txt - fontmap file for CDAC's TL-TTHema-Normal (Telugu)
	kan-translit.txt     - transliteration rule file for Kannada script 
	tam-translit.txt     - transliteration rule file for Tamil script
	tel-translit.txt  	 - transliteration rule file for Telugu script

(the above files together are of size 11K in total, after zip'ing).

Generic Transliteration tool (stand-alone)
	translit.c  - This will read in a translit rule file and transliterate
			      input words (stdin or a file) into glyph names as
				  specified in the fontmap file for that language.

(the above file is of size 16K after zip'ing).

Pls go thru this and let me know your comments. Send me mail for the above
files. Thanks.

Format for transliteration rules files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Most of the Indian languages (or at least the South Indian ones, Hindi,
Marathi) consist of various classes of 'letters' such as vowels and
consonants. These are the base classes. There are several derived or
dependent classes of letters (or letter forms or ligatures), such as
dependent vowel and consonant signs that are 'combined' with the base
consonants (base vowels are never modified or combined with any other forms),
to form various phonyms (or 'gunintha's as called in Kannada and Telugu).
There may be other 'letters' other than base vowels that stand on their own
or can't be 'mofified'. It depends on the particular language.

Most of the complications in rendering the letters on display comes in
finding out how to 'combine' the letter forms ([consonant]+ + vowel).
That's the motivation behind designing this 'generic' transliteration rule
format for Indian languages. I have defined the rules for Telugu, Tamil and
Kannada. Can easily be done for Hindi.

In any of the Indian languages, one 'unit' for transliteration is one
'word'. So, in the rule file, one rule will match one syllable or sub-word
and several such rules will complete transliteration of one input word.

The basic transliteration rules are usually simple. However, complications
arise due to special cases for several vowel/consonant combinations.
At times, one has to 'reorder', so to say, the input sequence so as to be
able to generate a linear sequence of glyphs to correctly represent the
input sequence.

'Half-consonants' or mathras or dependent consonant signs : There are no
code points for these in Unicode. Whenever an input sequence represents a
compound letter ('samyukthakshara'), the tranliteration rule has to be able
to take the context into account. Same with 'start of word' and 'end of
word' and many special cases.

There are always special cases for some consonants and vowels in terms of 
multiple version of the same for different contexts. Write the
transliteration rules as per the context.

Input reordering : instead of re-ordering the input, I have decided to use
recursion to transliterate sub-words and thus create the right order for
the glyph output sequence.

Input keyboard : regular Englist kb. It can actually be any keyboard as
long as each letter in the script can be represented by a 'unique' key
sequence from that kb.

One major Note : Since there don't seem to be any standards in definining
glyphs for Indian languages, no two font files for the same language/script
seem to have the same set of glyph definitions, unlike Latin languages
(checked up couple of Kannada TTF files, coupld of Telugu TTF files and
found that each have some extra definitions or don't have some). So, except
for the basic glyphs, you may have to revisit the translit rule file for each
font that you want to support.

What needs to be done before 'defining' the translit rules :
------------------------------------------------------------

1. Choose a sample font file (I chose CDAC's TTF files for Telugu, Kannada
   and Tamil - there are some deficiencies in them).

2. Study the font file, identify various classes of letters and forms.
3. Give names to whatever letters you feel should be named (glyph names).
4. Define a file, as suggested by Koshy, in which you define glyph names
   and give the indexes of the glyphs (0-255) that correspond to this glyph
   name (such as base vowels, base consonants, mathras, dependent vowel
   signs, specials such as 'anuswara', 'visarga' etc etc).

   This is the letter-form file.

Now back to the transliteration rule file.

The file is composed of several sections, separated by lines starting
with '=' followed by the name of the section :

Each section will have rules for that section in the format prescribed for
that section.

Glyph definitions : max length 31 [a-zA-Z][a-zA-Z0-9_]*
Variable names (e.g classes) same as glypy definitions
Class member names (such as vowels 'a' 'aa' etc) max length 7

Rough grammar for the rule file (not BNF) :

File : [section]+

section :
        =<section-name>
        {rules}+

Comments : ;.*$

Input stream : only printable English ASCII characters plus blank.

1.  Section to define classes of input symbols (e.g vowels, consonants)
    (As of now, I have pre-defined some classes. User-defined classes, say 
    'chillu's of Malayalam can be defined).

    The format of the rules in this section is :

    =class <class-name>    
    <class-name>    <str>[|str]*

    e.g 
    =class vowel
    vowel       a|aa|A|i|ee|u|oo|e|E|ai|o|O|au

    Note : alternate input sequences to denote the same letter (e.g A and
    aa) also listed as separate members in the above class definition.

    You basically define what input kb sequence will correspond to what
    'letter' of this class.

2.  Section for defining rules to re-order input to suit left-to-right 
    sequential rendering of glyphs. The input will not be consumed after the
    matching of any of the regexp's defined in this section.

    =reorder

    <inputstr>      <reordered-inputstr>

    As of now, I haven't really used this section. Managed to write the
    transliteration rules with recursion.

3.  Section for defining transliteration rules

        <regexp-for-inputstr>       [[glyph-name]|<function(parm[,parm]+>]+

    btw, the regexp here is not really as versatile as the Unix regexp.
    Have define a cut-down version that would suffice the purpose here.

    regexp (without any white-space) may consist of literals (such as 'a')
    or variables (such as %<classname>).

    Special chars in input spec :
        ^       start of word
        $       end of word
        _       Zero-width separator, specially used to impose generation
                of consonant+halanth
        |       to group alternate subexpressions of input
                (<str1>|<str2>|...)
                e.g (aa|A)
        %<classname>        input str that matches one of the members of
                            a particular class of letters
                            e.g %vowel, %cons
                            Class name should be one of defined classes in
                            the classes section.

    Special functions on glyph side
        translit(parmlist)  This is a special function to specify
                            transliteration of a sub-word of the input
                            stream matched by this rule.

                            Parameters can be :
                                %n      where n is the index (1-base) of
                                        input sub-expression matched
                                %<variable> such as %cons or %vowel
                                <literal> such as a member of a class,
                                          like 'aa' or 'E' but without the 
                                          quotes

                            This function will 'form' a word, picking out
                            the sub-strings from the matched input and then
                            do the transliteration from the start,
                            recursively.

        glyph(%<var>)       function to return the glyph(s) of a variable
                            which holds the value of matched input str.
                            e.g %vowel
        glyph(%1,%2,...)    glyph() will find out and return the glyph
                            names for the input sub-expressions 1, 2 etc

    On glyph side :         %1, %2 etc will denote matched sub-expressions
                            of input stream (numbered from 1).
    On glyph side : 
        unget(%<n>)         instruction to push characters after the 
                            specified numbered subexpression back onto 
                            the input stream
                            e.g unget(%1) will push back all chars after
                            the first subexpression onto the input stream.

    On glyph side :
        dep_vowel_sign(%n or %vowel or <vowel name>)
                            Here, %vowel should be the only vowel matched
                            in the input stream. Or you could give the name
                            of a vowel, which is defined in the vowel class
                            (here the name would be same as the input
                            sub-expression that would match that vowel).
                            Or, you could give the sub-expression number
                            and that should be a member of the vowel class.

        dep_cons_sign(%cons or %n or <consonant name>)
                            similar to dep. vowel name. Here you can
                            specify a particular consonant by its sequence
                            of occurence in the i/p stream e.g %1, %2 etc.

4.  Section to specify dependent vowel signs  (=dep_vowel_signs)

    <vowel name>            [glyph name]+

5.  Section to specify dependent consonant signs or mathras (=dep_cons_signs)

    <consonant name>        [glyph name]+

6.  Section specify base vowel signs  (=vowel)

    <vowel name>             [glyph name]+

7.  Section specify base consonant signs  (=cons)

    <consonant name>             [glyph name]+

All rules should be ordered as per the precedence that you want them to be
applied with, highest precedence first.

--
cheers,

Nagarajan

________________K. Nagarajan_________________________________________________
Hewlett-Packard,                    Phone (W)        : +91-80-286-3394 x1182
Indian Express Building, Queens Road
Hewlett Packard - ISO,              Fax:             : +91-80-226 4107
                                                       +91-80-226 4108
Bangalore - 560 052,                HP Telnet (India): 847-1182 
                                    Internet         : kn...@in...
_____________________________________________________________________________