Re: [Indic-computing-devel] Regexp and Indian languages ?
Status: Alpha
Brought to you by:
jkoshy
From: Krishnamurthy N. <kn...@ya...> - 2004-11-26 10:09:06
|
Hi Arun, Perhaps you could take a look at the generic transliteration library for Indian languages that I developed quite sometime back. It's on sourceforge at http://indic-computing.sourceforge.net/projects/miscellaneous.html (under 'Other infrastructural projects', as 'translib') I had come up with some kind of regular expression syntax to express the syllables in Indian words. I developed sample transliteration rules for four languages (Hindi, Telugu, Kannada and Tamil). A snippet from the ruleset for Hindi, just to raise your curiosity : ^%vowel glyph(%vowel) _%vowel glyph(%vowel) r%cons%vowel translit(%2,%vowel) HALF_R_POST (%cons)a translit(%1,a) (%cons)(A|aa) translit(%1,a) VOWEL_SIGN_AA %cons%vowel translit(%1,a) dep_vowel_sign(%vowel) %cons%cons%vowel dep_cons_sign(%1) translit(%2,%3) ..... (^ is used by me to denote beginning of word, $ for end of word, _ for forced ZWNJ etc) Here, the LHS corresponds to a subset of a word (a syllable, usually) and the RHS denotes the action, to output the glyphs or other actions (including recursive call to the main transliteration function translit()). One or more such sub-expressions would constitute an input word. btw, I didn't use the regular Unix regexp syntax. With the framework and syntax I developed, it's quite feasible to write a regexp parser for Indian languages (transliterated using US-English or even direct UTF-8 or other forms) using such rules. I hope my answer is relevant to your question. cheers, Nagarajan Indic-computing project --- Arun Sharma <ar...@sh...> wrote: > So I was thinking about how one would go about using > regular expressions > with an Indian language while I was brushing my > teeth this morning. > > The current syntax seems to be "character" oriented. > For eg, f.o matches foo. > However, if I want to write a regexp such as: > > su . la . > > that matches > > su bbu la xmi > > we need to introduce a new concept of a syllable > into the regexp > syntax. For eg: "_" might mean one syllable as > opposed to "." which > means one character. ... __________________________________ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com |