[Indic-computing-devel] Generic transliteration rule format for Indian languages
Status: Alpha
Brought to you by:
jkoshy
From: K N. <kn...@wi...> - 2002-08-09 00:43:10
|
Hi all, In an attempt to understand how transliteration could be done for majority of the Indian languages with generic principles, I have studied some of the South Indian languages and of course Hindi and came up with a generic framework to specify the transliteration rules (in the form of a set of grammar rules) for tranaliterating one 'word'. In this mail, I explain the format of the transliteration rule file. I have the following files which I can send thru email to anyone who would like to take a look ( I will put them up in SourceForge in a few days). The fontmap files & and the transliteration rule files : cdac-kan-fontmap.txt - fontmap file for CDAC's KN-TTUma-Normal font cdac-tam-fontmap.txt - fontmap file for CDAC's TM-TTValluvar-Normal (Tamil) cdac-tel-fontmap.txt - fontmap file for CDAC's TL-TTHema-Normal (Telugu) kan-translit.txt - transliteration rule file for Kannada script tam-translit.txt - transliteration rule file for Tamil script tel-translit.txt - transliteration rule file for Telugu script (the above files together are of size 11K in total, after zip'ing). Generic Transliteration tool (stand-alone) translit.c - This will read in a translit rule file and transliterate input words (stdin or a file) into glyph names as specified in the fontmap file for that language. (the above file is of size 16K after zip'ing). Pls go thru this and let me know your comments. Send me mail for the above files. Thanks. Format for transliteration rules files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most of the Indian languages (or at least the South Indian ones, Hindi, Marathi) consist of various classes of 'letters' such as vowels and consonants. These are the base classes. There are several derived or dependent classes of letters (or letter forms or ligatures), such as dependent vowel and consonant signs that are 'combined' with the base consonants (base vowels are never modified or combined with any other forms), to form various phonyms (or 'gunintha's as called in Kannada and Telugu). There may be other 'letters' other than base vowels that stand on their own or can't be 'mofified'. It depends on the particular language. Most of the complications in rendering the letters on display comes in finding out how to 'combine' the letter forms ([consonant]+ + vowel). That's the motivation behind designing this 'generic' transliteration rule format for Indian languages. I have defined the rules for Telugu, Tamil and Kannada. Can easily be done for Hindi. In any of the Indian languages, one 'unit' for transliteration is one 'word'. So, in the rule file, one rule will match one syllable or sub-word and several such rules will complete transliteration of one input word. The basic transliteration rules are usually simple. However, complications arise due to special cases for several vowel/consonant combinations. At times, one has to 'reorder', so to say, the input sequence so as to be able to generate a linear sequence of glyphs to correctly represent the input sequence. 'Half-consonants' or mathras or dependent consonant signs : There are no code points for these in Unicode. Whenever an input sequence represents a compound letter ('samyukthakshara'), the tranliteration rule has to be able to take the context into account. Same with 'start of word' and 'end of word' and many special cases. There are always special cases for some consonants and vowels in terms of multiple version of the same for different contexts. Write the transliteration rules as per the context. Input reordering : instead of re-ordering the input, I have decided to use recursion to transliterate sub-words and thus create the right order for the glyph output sequence. Input keyboard : regular Englist kb. It can actually be any keyboard as long as each letter in the script can be represented by a 'unique' key sequence from that kb. One major Note : Since there don't seem to be any standards in definining glyphs for Indian languages, no two font files for the same language/script seem to have the same set of glyph definitions, unlike Latin languages (checked up couple of Kannada TTF files, coupld of Telugu TTF files and found that each have some extra definitions or don't have some). So, except for the basic glyphs, you may have to revisit the translit rule file for each font that you want to support. What needs to be done before 'defining' the translit rules : ------------------------------------------------------------ 1. Choose a sample font file (I chose CDAC's TTF files for Telugu, Kannada and Tamil - there are some deficiencies in them). 2. Study the font file, identify various classes of letters and forms. 3. Give names to whatever letters you feel should be named (glyph names). 4. Define a file, as suggested by Koshy, in which you define glyph names and give the indexes of the glyphs (0-255) that correspond to this glyph name (such as base vowels, base consonants, mathras, dependent vowel signs, specials such as 'anuswara', 'visarga' etc etc). This is the letter-form file. Now back to the transliteration rule file. The file is composed of several sections, separated by lines starting with '=' followed by the name of the section : Each section will have rules for that section in the format prescribed for that section. Glyph definitions : max length 31 [a-zA-Z][a-zA-Z0-9_]* Variable names (e.g classes) same as glypy definitions Class member names (such as vowels 'a' 'aa' etc) max length 7 Rough grammar for the rule file (not BNF) : File : [section]+ section : =<section-name> {rules}+ Comments : ;.*$ Input stream : only printable English ASCII characters plus blank. 1. Section to define classes of input symbols (e.g vowels, consonants) (As of now, I have pre-defined some classes. User-defined classes, say 'chillu's of Malayalam can be defined). The format of the rules in this section is : =class <class-name> <class-name> <str>[|str]* e.g =class vowel vowel a|aa|A|i|ee|u|oo|e|E|ai|o|O|au Note : alternate input sequences to denote the same letter (e.g A and aa) also listed as separate members in the above class definition. You basically define what input kb sequence will correspond to what 'letter' of this class. 2. Section for defining rules to re-order input to suit left-to-right sequential rendering of glyphs. The input will not be consumed after the matching of any of the regexp's defined in this section. =reorder <inputstr> <reordered-inputstr> As of now, I haven't really used this section. Managed to write the transliteration rules with recursion. 3. Section for defining transliteration rules <regexp-for-inputstr> [[glyph-name]|<function(parm[,parm]+>]+ btw, the regexp here is not really as versatile as the Unix regexp. Have define a cut-down version that would suffice the purpose here. regexp (without any white-space) may consist of literals (such as 'a') or variables (such as %<classname>). Special chars in input spec : ^ start of word $ end of word _ Zero-width separator, specially used to impose generation of consonant+halanth | to group alternate subexpressions of input (<str1>|<str2>|...) e.g (aa|A) %<classname> input str that matches one of the members of a particular class of letters e.g %vowel, %cons Class name should be one of defined classes in the classes section. Special functions on glyph side translit(parmlist) This is a special function to specify transliteration of a sub-word of the input stream matched by this rule. Parameters can be : %n where n is the index (1-base) of input sub-expression matched %<variable> such as %cons or %vowel <literal> such as a member of a class, like 'aa' or 'E' but without the quotes This function will 'form' a word, picking out the sub-strings from the matched input and then do the transliteration from the start, recursively. glyph(%<var>) function to return the glyph(s) of a variable which holds the value of matched input str. e.g %vowel glyph(%1,%2,...) glyph() will find out and return the glyph names for the input sub-expressions 1, 2 etc On glyph side : %1, %2 etc will denote matched sub-expressions of input stream (numbered from 1). On glyph side : unget(%<n>) instruction to push characters after the specified numbered subexpression back onto the input stream e.g unget(%1) will push back all chars after the first subexpression onto the input stream. On glyph side : dep_vowel_sign(%n or %vowel or <vowel name>) Here, %vowel should be the only vowel matched in the input stream. Or you could give the name of a vowel, which is defined in the vowel class (here the name would be same as the input sub-expression that would match that vowel). Or, you could give the sub-expression number and that should be a member of the vowel class. dep_cons_sign(%cons or %n or <consonant name>) similar to dep. vowel name. Here you can specify a particular consonant by its sequence of occurence in the i/p stream e.g %1, %2 etc. 4. Section to specify dependent vowel signs (=dep_vowel_signs) <vowel name> [glyph name]+ 5. Section to specify dependent consonant signs or mathras (=dep_cons_signs) <consonant name> [glyph name]+ 6. Section specify base vowel signs (=vowel) <vowel name> [glyph name]+ 7. Section specify base consonant signs (=cons) <consonant name> [glyph name]+ All rules should be ordered as per the precedence that you want them to be applied with, highest precedence first. -- cheers, Nagarajan ________________K. Nagarajan_________________________________________________ Hewlett-Packard, Phone (W) : +91-80-286-3394 x1182 Indian Express Building, Queens Road Hewlett Packard - ISO, Fax: : +91-80-226 4107 +91-80-226 4108 Bangalore - 560 052, HP Telnet (India): 847-1182 Internet : kn...@in... _____________________________________________________________________________ |