Thread: [Indic-computing-devel] Regexp and Indian languages ?
Status: Alpha
Brought to you by:
jkoshy
From: Arun S. <ar...@sh...> - 2004-11-26 00:06:17
|
So I was thinking about how one would go about using regular expressions with an Indian language while I was brushing my teeth this morning. The current syntax seems to be "character" oriented. For eg, f.o matches foo. However, if I want to write a regexp such as: su . la . that matches su bbu la xmi we need to introduce a new concept of a syllable into the regexp syntax. For eg: "_" might mean one syllable as opposed to "." which means one character. In other words "su_la_" would match subbulaxmi. This simple minded proposal would mean that the zillions of existing regexps which use "_" without suspecting it to be a special character would be broken. This might be a good undergrad project for the linguistically inclined (and hence the crosspost to Linux and BSD mailing lists which often get such queries). If there is existing literature on this topic, I'd love to find out more. -Arun |
From: Krishnamurthy N. <kn...@ya...> - 2004-11-26 10:09:06
|
Hi Arun, Perhaps you could take a look at the generic transliteration library for Indian languages that I developed quite sometime back. It's on sourceforge at http://indic-computing.sourceforge.net/projects/miscellaneous.html (under 'Other infrastructural projects', as 'translib') I had come up with some kind of regular expression syntax to express the syllables in Indian words. I developed sample transliteration rules for four languages (Hindi, Telugu, Kannada and Tamil). A snippet from the ruleset for Hindi, just to raise your curiosity : ^%vowel glyph(%vowel) _%vowel glyph(%vowel) r%cons%vowel translit(%2,%vowel) HALF_R_POST (%cons)a translit(%1,a) (%cons)(A|aa) translit(%1,a) VOWEL_SIGN_AA %cons%vowel translit(%1,a) dep_vowel_sign(%vowel) %cons%cons%vowel dep_cons_sign(%1) translit(%2,%3) ..... (^ is used by me to denote beginning of word, $ for end of word, _ for forced ZWNJ etc) Here, the LHS corresponds to a subset of a word (a syllable, usually) and the RHS denotes the action, to output the glyphs or other actions (including recursive call to the main transliteration function translit()). One or more such sub-expressions would constitute an input word. btw, I didn't use the regular Unix regexp syntax. With the framework and syntax I developed, it's quite feasible to write a regexp parser for Indian languages (transliterated using US-English or even direct UTF-8 or other forms) using such rules. I hope my answer is relevant to your question. cheers, Nagarajan Indic-computing project --- Arun Sharma <ar...@sh...> wrote: > So I was thinking about how one would go about using > regular expressions > with an Indian language while I was brushing my > teeth this morning. > > The current syntax seems to be "character" oriented. > For eg, f.o matches foo. > However, if I want to write a regexp such as: > > su . la . > > that matches > > su bbu la xmi > > we need to introduce a new concept of a syllable > into the regexp > syntax. For eg: "_" might mean one syllable as > opposed to "." which > means one character. ... __________________________________ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com |
From: <jit...@nc...> - 2004-11-26 14:49:40
|
Dear Krishnamurthy Nagarajan We at janabhaaratii feel indebted to the pioneering start your efforts (indic computing develpers team in general and some of you named in email addresses here in particular) in indic computing. Under the C-DAC project janabhaaratii,funded by TDIL we wish to take this forward in colaboration and fully sharing mode. Your suggestion and ideas will be most appreciated. Kindly do give us your current coordinates(address/phones/afiliations etc.) so whenever we wish we can contact you and even invite you. Hence please also keep us informed on your current project. On our side we intend to work exclusively on GPL/LGPL software and will put up our contributions/compilations on our project website for 'free' access. Since we have just stated the project last month, our project website is under constution. But our mission statement is on our corporate website. www.cdacindia.com regards jitendra Quoting Krishnamurthy Nagarajan <kn...@ya...>: > > ----- Original message from Krishnamurthy Nagarajan <kn...@ya...> ----- > Date: Fri, 26 Nov 2004 02:08:57 -0800 (PST) > From: Krishnamurthy Nagarajan <kn...@ya...> > Reply-To: Krishnamurthy Nagarajan <kn...@ya...> > Subject: Re: [Indic-computing-devel] Regexp and Indian languages ? > To: Arun Sharma <ar...@sh...>, > ind...@li... > > Hi Arun, > > Perhaps you could take a look at the generic > transliteration library for Indian languages that I > developed quite sometime back. It's on sourceforge at > http://indic-computing.sourceforge.net/projects/miscellaneous.html > > (under 'Other infrastructural projects', as > 'translib') > > I had come up with some kind of regular expression > syntax to express the syllables in Indian words. I > developed sample transliteration rules for four > languages (Hindi, Telugu, Kannada and Tamil). > > A snippet from the ruleset for Hindi, just to raise > your curiosity : > > ^%vowel glyph(%vowel) > _%vowel glyph(%vowel) > r%cons%vowel translit(%2,%vowel) > HALF_R_POST > (%cons)a translit(%1,a) > (%cons)(A|aa) translit(%1,a) VOWEL_SIGN_AA > %cons%vowel translit(%1,a) > dep_vowel_sign(%vowel) > %cons%cons%vowel dep_cons_sign(%1) > translit(%2,%3) > ..... > > (^ is used by me to denote beginning of word, $ for > end of word, _ for forced ZWNJ etc) > > Here, the LHS corresponds to a subset of a word (a > syllable, usually) and the RHS denotes the action, to > output the glyphs or other actions (including > recursive call to the main transliteration function > translit()). One or more such sub-expressions would > constitute an input word. > > btw, I didn't use the regular Unix regexp syntax. With > the framework and syntax I developed, it's quite > feasible to write a regexp parser for Indian > languages (transliterated using US-English or even > direct UTF-8 or other forms) using such rules. > > I hope my answer is relevant to your question. > > cheers, > Nagarajan > Indic-computing project > > --- Arun Sharma <ar...@sh...> wrote: > > > So I was thinking about how one would go about using > > regular expressions > > with an Indian language while I was brushing my > > teeth this morning. > > > > The current syntax seems to be "character" oriented. > > For eg, f.o matches foo. > > However, if I want to write a regexp such as: > > > > su . la . > > > > that matches > > > > su bbu la xmi > > > > we need to introduce a new concept of a syllable > > into the regexp > > syntax. For eg: "_" might mean one syllable as > > opposed to "." which > > means one character. > ... > > > > __________________________________ > Do you Yahoo!? > The all-new My Yahoo! - Get yours free! > http://my.yahoo.com > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://productguide.itmanagersjournal.com/ > _______________________________________________ > Indic-computing-devel mailing list > http://indic-computing.sourceforge.net/ > Ind...@li... > https://lists.sourceforge.net/lists/listinfo/indic-computing-devel > [Other Indic-Computing mailing lists available: -users, -standards, > -announce] > > --------------------------------------------------------------- This mail is sent through IMP: http://horde.org/imp/ Used as the Webmail Interface at C-DAC, Mumbai: http://www.ncst.ernet.in |
From: Arun S. <ar...@sh...> - 2004-11-26 17:47:52
|
On Fri, Nov 26, 2004 at 02:09:19PM +0530, Sayamindu Dasgupta wrote: > This link may be of interest > http://www.unicode.org/reports/tr18/ Thank you! This was exactly what I was looking for. Grapheme clusters (sec 2.2 and 3.2) seem to be meant for just this. > For example, an implementation could interpret "\X" as matching any > default grapheme cluster, while interpreting "." as matching any single > code point. It could interpret "\h" as a zero-width match against any > grapheme cluster boundary, and "\H" as the negation of that. Now, are there any open source implementations of these specs for C/C++ and Java? What about std::string and java.lang.String? They need to have iterators to iterate over grapheme clusters as well. -Arun |
From: Sayamindu D. <say...@cl...> - 2004-11-27 04:56:22
|
On Fri, 2004-11-26 at 09:47 -0800, Arun Sharma wrote: > On Fri, Nov 26, 2004 at 02:09:19PM +0530, Sayamindu Dasgupta wrote: > > This link may be of interest > > http://www.unicode.org/reports/tr18/ > > Thank you! This was exactly what I was looking for. Grapheme > clusters (sec 2.2 and 3.2) seem to be meant for just this. > > > For example, an implementation could interpret "\X" as matching any > > default grapheme cluster, while interpreting "." as matching any single > > code point. It could interpret "\h" as a zero-width match against any > > grapheme cluster boundary, and "\H" as the negation of that. > > Now, are there any open source implementations of these specs for C/C++ > and Java? What about std::string and java.lang.String? They need to > have iterators to iterate over grapheme clusters as well. IBM ICU probably implements at least a subset of these specs. http://oss.software.ibm.com/icu/userguide/regexp.html There are bindings for Java, as well as C/C++ -thanks- Sayamindu |
From: Sayamindu D. <say...@cl...> - 2004-11-26 08:40:02
|
On Thu, 2004-11-25 at 16:06 -0800, Arun Sharma wrote: > So I was thinking about how one would go about using regular expressions > with an Indian language while I was brushing my teeth this morning. > > The current syntax seems to be "character" oriented. For eg, f.o matches foo. > However, if I want to write a regexp such as: > > su . la . > > that matches > > su bbu la xmi > > we need to introduce a new concept of a syllable into the regexp > syntax. For eg: "_" might mean one syllable as opposed to "." which > means one character. > > In other words "su_la_" would match subbulaxmi. This simple minded > proposal would mean that the zillions of existing regexps which use > "_" without suspecting it to be a special character would be broken. This link may be of interest http://www.unicode.org/reports/tr18/ -thanks- Sayamindu |
From: B G. <bg...@gm...> - 2004-11-26 18:16:22
|
Greetings, On Thu, 25 Nov 2004 16:06:09 -0800, Arun Sharma <ar...@sh...> wrote: > So I was thinking about how one would go about using regular expressions > with an Indian language while I was brushing my teeth this morning. The IIT-Madras Multilingual editor has a perl module that does this. Prof Kalyana Krishnan has released the full sources for everything (including the multi linugal editor) under the GPL. work has begun at http://imli.sf.net We have a linguist on the team working with him full time to sort out the nitty gritties. There's also a version that speaks out the content (developed for the blind). If you are interested let me know and I'll send more info :) cheers BGa -- We will find a way, or we will make one - Hannibal |
From: B G. <bg...@gm...> - 2004-11-26 18:18:13
|
Arun, The linguist I'd mentioned in the previous mail is Indrani Roy, and I've copied her... in case you need more info, she'd be the best person to ask... cheers BGa On Thu, 25 Nov 2004 16:06:09 -0800, Arun Sharma <ar...@sh...> wrote: > So I was thinking about how one would go about using regular expressions > with an Indian language while I was brushing my teeth this morning. > > The current syntax seems to be "character" oriented. For eg, f.o matches foo. > However, if I want to write a regexp such as: > > su . la . > > that matches > > su bbu la xmi > > we need to introduce a new concept of a syllable into the regexp > syntax. For eg: "_" might mean one syllable as opposed to "." which > means one character. > > In other words "su_la_" would match subbulaxmi. This simple minded > proposal would mean that the zillions of existing regexps which use > "_" without suspecting it to be a special character would be broken. > > This might be a good undergrad project for the linguistically inclined > (and hence the crosspost to Linux and BSD mailing lists which often get > such queries). > > If there is existing literature on this topic, I'd love to find out more. > > -Arun > _______________________________________________ > bsd-india mailing list > bsd...@bs... > http://www.bsd-india.org/mailman/listinfo/bsd-india > -- We will find a way, or we will make one - Hannibal |