Thread: [Indic-computing-devel] Regexp and Indian languages ?

Status: Alpha

Brought to you by: jkoshy

indic-computing-devel

[Indic-computing-devel] Regexp and Indian languages ?

From: Arun S. <ar...@sh...> - 2004-11-26 00:06:17

So I was thinking about how one would go about using regular expressions
with an Indian language while I was brushing my teeth this morning.

The current syntax seems to be "character" oriented. For eg, f.o matches foo.
However, if I want to write a regexp such as:

su . la .

that matches 

su bbu la xmi 

we need to introduce a new concept of a syllable into the regexp
syntax. For eg: "_" might mean one syllable as opposed to "." which
means one character.

In other words "su_la_" would match subbulaxmi. This simple minded
proposal would mean that the zillions of existing regexps which use
"_" without suspecting it to be a special character would be broken.

This might be a good undergrad project for the linguistically inclined
(and hence the crosspost to Linux and BSD mailing lists which often get
such queries).

If there is existing literature on this topic, I'd love to find out more.

	-Arun

Re: [Indic-computing-devel] Regexp and Indian languages ?

From: Krishnamurthy N. <kn...@ya...> - 2004-11-26 10:09:06

Hi Arun,

Perhaps you could take a look at the generic
transliteration library for Indian languages that I
developed quite sometime back. It's on sourceforge at
http://indic-computing.sourceforge.net/projects/miscellaneous.html

(under 'Other infrastructural projects', as
'translib')

I had come up with some kind of regular expression
syntax to express the syllables in Indian words. I
developed sample transliteration rules for four
languages (Hindi, Telugu, Kannada and Tamil). 

A snippet from the ruleset for Hindi, just to raise
your curiosity :

^%vowel                 glyph(%vowel)
_%vowel                 glyph(%vowel)
r%cons%vowel            translit(%2,%vowel)
HALF_R_POST
(%cons)a                translit(%1,a)
(%cons)(A|aa)           translit(%1,a) VOWEL_SIGN_AA
%cons%vowel             translit(%1,a)
dep_vowel_sign(%vowel)
%cons%cons%vowel      dep_cons_sign(%1)
translit(%2,%3)
.....

(^ is used by me to denote beginning of word, $ for
end of word, _ for forced ZWNJ etc)

Here, the LHS corresponds to a subset of a word (a
syllable, usually) and the RHS denotes the action, to 
output the glyphs or other actions (including
recursive call to the main transliteration function
translit()). One or more such sub-expressions would
constitute an input word.

btw, I didn't use the regular Unix regexp syntax. With
the framework and syntax I developed, it's quite
feasible to write a regexp parser for Indian 
languages (transliterated using US-English or even
direct UTF-8 or other forms) using such rules.

I hope my answer is relevant to your question.

cheers,
Nagarajan
Indic-computing project

--- Arun Sharma <ar...@sh...> wrote:

> So I was thinking about how one would go about using
> regular expressions
> with an Indian language while I was brushing my
> teeth this morning.
> 
> The current syntax seems to be "character" oriented.
> For eg, f.o matches foo.
> However, if I want to write a regexp such as:
> 
> su . la .
> 
> that matches 
> 
> su bbu la xmi 
> 
> we need to introduce a new concept of a syllable
> into the regexp
> syntax. For eg: "_" might mean one syllable as
> opposed to "." which
> means one character.
...

__________________________________ 
Do you Yahoo!? 
The all-new My Yahoo! - Get yours free! 
http://my.yahoo.com

Re: [Indic-computing-devel] Regexp and Indian languages ?

From: <jit...@nc...> - 2004-11-26 14:49:40

Dear Krishnamurthy Nagarajan
We at janabhaaratii feel indebted to the pioneering start your efforts (indic
computing develpers team in general and some of you named in email addresses
here in particular) in indic computing.
Under the C-DAC project janabhaaratii,funded by TDIL we wish to take this
forward in colaboration and fully sharing mode.
Your suggestion and ideas will be most appreciated.
Kindly do give us your current coordinates(address/phones/afiliations etc.) so
whenever we wish we can contact you and even invite you.

Hence please also keep us informed on your current project.
On our side we intend to work exclusively on GPL/LGPL software and will put up
our contributions/compilations on our project website for 'free' access.
Since we have just stated the project last month, our project website is under
constution. But our mission statement is on our corporate website. www.cdacindia.com

regards
jitendra


Quoting Krishnamurthy Nagarajan <kn...@ya...>:

> 
> ----- Original message from Krishnamurthy Nagarajan <kn...@ya...> -----
>     Date: Fri, 26 Nov 2004 02:08:57 -0800 (PST)
>     From: Krishnamurthy Nagarajan <kn...@ya...>
> Reply-To: Krishnamurthy Nagarajan <kn...@ya...>
>  Subject: Re: [Indic-computing-devel] Regexp and Indian languages ?
>       To: Arun Sharma <ar...@sh...>,
> ind...@li...
> 
> Hi Arun,
> 
> Perhaps you could take a look at the generic
> transliteration library for Indian languages that I
> developed quite sometime back. It's on sourceforge at
> http://indic-computing.sourceforge.net/projects/miscellaneous.html
> 
> (under 'Other infrastructural projects', as
> 'translib')
> 
> I had come up with some kind of regular expression
> syntax to express the syllables in Indian words. I
> developed sample transliteration rules for four
> languages (Hindi, Telugu, Kannada and Tamil). 
> 
> A snippet from the ruleset for Hindi, just to raise
> your curiosity :
> 
> ^%vowel                 glyph(%vowel)
> _%vowel                 glyph(%vowel)
> r%cons%vowel            translit(%2,%vowel)
> HALF_R_POST
> (%cons)a                translit(%1,a)
> (%cons)(A|aa)           translit(%1,a) VOWEL_SIGN_AA
> %cons%vowel             translit(%1,a)
> dep_vowel_sign(%vowel)
> %cons%cons%vowel      dep_cons_sign(%1)
> translit(%2,%3)
> .....
> 
> (^ is used by me to denote beginning of word, $ for
> end of word, _ for forced ZWNJ etc)
> 
> Here, the LHS corresponds to a subset of a word (a
> syllable, usually) and the RHS denotes the action, to 
> output the glyphs or other actions (including
> recursive call to the main transliteration function
> translit()). One or more such sub-expressions would
> constitute an input word.
> 
> btw, I didn't use the regular Unix regexp syntax. With
> the framework and syntax I developed, it's quite
> feasible to write a regexp parser for Indian 
> languages (transliterated using US-English or even
> direct UTF-8 or other forms) using such rules.
> 
> I hope my answer is relevant to your question.
> 
> cheers,
> Nagarajan
> Indic-computing project
> 
> --- Arun Sharma <ar...@sh...> wrote:
> 
> > So I was thinking about how one would go about using
> > regular expressions
> > with an Indian language while I was brushing my
> > teeth this morning.
> > 
> > The current syntax seems to be "character" oriented.
> > For eg, f.o matches foo.
> > However, if I want to write a regexp such as:
> > 
> > su . la .
> > 
> > that matches 
> > 
> > su bbu la xmi 
> > 
> > we need to introduce a new concept of a syllable
> > into the regexp
> > syntax. For eg: "_" might mean one syllable as
> > opposed to "." which
> > means one character.
> ...
> 
> 
> 		
> __________________________________ 
> Do you Yahoo!? 
> The all-new My Yahoo! - Get yours free! 
> http://my.yahoo.com 
>  
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now. 
> http://productguide.itmanagersjournal.com/
> _______________________________________________
> Indic-computing-devel mailing list
> http://indic-computing.sourceforge.net/
> Ind...@li...
> https://lists.sourceforge.net/lists/listinfo/indic-computing-devel
> [Other Indic-Computing mailing lists available: -users, -standards,
> -announce]
> 
> 




---------------------------------------------------------------
This mail is sent through IMP: http://horde.org/imp/
Used as the Webmail Interface at C-DAC, Mumbai: http://www.ncst.ernet.in

[Indic-computing-devel] Re: [LIG] Regexp and Indian languages ?

From: Arun S. <ar...@sh...> - 2004-11-26 17:47:52

On Fri, Nov 26, 2004 at 02:09:19PM +0530, Sayamindu Dasgupta wrote:
> This link may be of interest
> http://www.unicode.org/reports/tr18/

Thank you! This was exactly what I was looking for. Grapheme 
clusters (sec 2.2 and 3.2) seem to be meant for just this.

> For example, an implementation could interpret "\X" as matching any
> default grapheme cluster, while interpreting "." as matching any single
> code point. It could interpret "\h" as a zero-width match against any
> grapheme cluster boundary, and "\H" as the negation of that.

Now, are there any open source implementations of these specs for C/C++
and Java?  What about std::string and java.lang.String? They need to
have iterators to iterate over grapheme clusters as well.

	-Arun

[Indic-computing-devel] Re: [LIG] Regexp and Indian languages ?

From: Sayamindu D. <say...@cl...> - 2004-11-27 04:56:22

On Fri, 2004-11-26 at 09:47 -0800, Arun Sharma wrote:
> On Fri, Nov 26, 2004 at 02:09:19PM +0530, Sayamindu Dasgupta wrote:
> > This link may be of interest
> > http://www.unicode.org/reports/tr18/
> 
> Thank you! This was exactly what I was looking for. Grapheme 
> clusters (sec 2.2 and 3.2) seem to be meant for just this.
> 
> > For example, an implementation could interpret "\X" as matching any
> > default grapheme cluster, while interpreting "." as matching any single
> > code point. It could interpret "\h" as a zero-width match against any
> > grapheme cluster boundary, and "\H" as the negation of that.
> 
> Now, are there any open source implementations of these specs for C/C++
> and Java?  What about std::string and java.lang.String? They need to
> have iterators to iterate over grapheme clusters as well.

IBM ICU probably implements at least a subset of these specs.
http://oss.software.ibm.com/icu/userguide/regexp.html

There are bindings for Java, as well as C/C++

-thanks-
Sayamindu

[Indic-computing-devel] Re: [LIG] Regexp and Indian languages ?

From: Sayamindu D. <say...@cl...> - 2004-11-26 08:40:02

On Thu, 2004-11-25 at 16:06 -0800, Arun Sharma wrote:
> So I was thinking about how one would go about using regular expressions
> with an Indian language while I was brushing my teeth this morning.
> 
> The current syntax seems to be "character" oriented. For eg, f.o matches foo.
> However, if I want to write a regexp such as:
> 
> su . la .
> 
> that matches 
> 
> su bbu la xmi 
> 
> we need to introduce a new concept of a syllable into the regexp
> syntax. For eg: "_" might mean one syllable as opposed to "." which
> means one character.
> 
> In other words "su_la_" would match subbulaxmi. This simple minded
> proposal would mean that the zillions of existing regexps which use
> "_" without suspecting it to be a special character would be broken.

This link may be of interest
http://www.unicode.org/reports/tr18/

-thanks-
Sayamindu

[Indic-computing-devel] Re: [BSD-INDIA] Regexp and Indian languages ?

From: B G. <bg...@gm...> - 2004-11-26 18:16:22

Greetings,

On Thu, 25 Nov 2004 16:06:09 -0800, Arun Sharma <ar...@sh...> wrote:
> So I was thinking about how one would go about using regular expressions
> with an Indian language while I was brushing my teeth this morning.

The IIT-Madras Multilingual editor has a perl module that does this.
Prof Kalyana Krishnan has released the full sources for everything
(including the multi linugal editor) under the GPL.

work has begun at http://imli.sf.net We have a linguist on the team
working with him full time to sort out the nitty gritties. There's
also a version that speaks out the content (developed for the blind).
If you are interested let me know and I'll send more info :)

cheers
BGa
-- 
We will find a way, or we will make one - Hannibal

[Indic-computing-devel] Re: [BSD-INDIA] Regexp and Indian languages ?

From: B G. <bg...@gm...> - 2004-11-26 18:18:13

Arun,

  The linguist I'd mentioned in the previous mail is Indrani Roy, and
I've copied her... in case you need more info, she'd be the best
person to ask...

cheers
BGa


On Thu, 25 Nov 2004 16:06:09 -0800, Arun Sharma <ar...@sh...> wrote:
> So I was thinking about how one would go about using regular expressions
> with an Indian language while I was brushing my teeth this morning.
> 
> The current syntax seems to be "character" oriented. For eg, f.o matches foo.
> However, if I want to write a regexp such as:
> 
> su . la .
> 
> that matches
> 
> su bbu la xmi
> 
> we need to introduce a new concept of a syllable into the regexp
> syntax. For eg: "_" might mean one syllable as opposed to "." which
> means one character.
> 
> In other words "su_la_" would match subbulaxmi. This simple minded
> proposal would mean that the zillions of existing regexps which use
> "_" without suspecting it to be a special character would be broken.
> 
> This might be a good undergrad project for the linguistically inclined
> (and hence the crosspost to Linux and BSD mailing lists which often get
> such queries).
> 
> If there is existing literature on this topic, I'd love to find out more.
> 
>         -Arun
> _______________________________________________
> bsd-india mailing list
> bsd...@bs...
> http://www.bsd-india.org/mailman/listinfo/bsd-india
> 


-- 
We will find a way, or we will make one - Hannibal