[Indic-computing-devel] Regexp and Indian languages ?
Status: Alpha
Brought to you by:
jkoshy
From: Arun S. <ar...@sh...> - 2004-11-26 00:06:17
|
So I was thinking about how one would go about using regular expressions with an Indian language while I was brushing my teeth this morning. The current syntax seems to be "character" oriented. For eg, f.o matches foo. However, if I want to write a regexp such as: su . la . that matches su bbu la xmi we need to introduce a new concept of a syllable into the regexp syntax. For eg: "_" might mean one syllable as opposed to "." which means one character. In other words "su_la_" would match subbulaxmi. This simple minded proposal would mean that the zillions of existing regexps which use "_" without suspecting it to be a special character would be broken. This might be a good undergrad project for the linguistically inclined (and hence the crosspost to Linux and BSD mailing lists which often get such queries). If there is existing literature on this topic, I'd love to find out more. -Arun |