Re: [Indic-computing-devel] Script specific features

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 25 Feb 2002, Keyur Shroff wrote:

> For example, Malayalam has "Chillaksharam"

I have been experimenting with the IndiX patch for Malayalam support and
has had some initial success with it. My first impression with
"Chillaksharam" was that It is a unique feature of Malayalam and it will
require some modifications in Unicode stds to accommodate them. But later
Apurva Joshi <ap...@mi...>, who deals with Indic scripts in MS
clarified that

<quote>
I assume that "chillu form" above means the "chillaksharam form" that only
a few consonants in Malayalam take. If so, the following explanation is
how this form is currently implemented: If the last consonant in a
Malayalam word is capable of forming a chillaksharam, and it is followed
by a Halant/Virama followed by a word delimiter [in most Indic scripts
this is the space]; this sequence is displayed as consonant Halant. Thus:
Kha Ka Halant is displayed as Kha Ka_Chillaksharam.

In the above case if you would like to convert the chillaksharam to its
consonant+Halant form you need to insert a ZWJ after the Halant; thus: Kha
Ka Halant ZWJ This will display as Kha Ka Halant.

And for input sequences like those given below, where the consonant
capable of forming a chillaksharam is not the last consonant in a
syllable, the following is done: Kha Na Halant Kha; the final display will
be Kha Na Halant Kha.

If the Na Halant, in the above case, which does not occur at the end of
the word; needs to be retained as the chillaksharam form, you need to
insert a ZWNJ thus: Kha Na Halant ZWNJ Kha; this will display as Kha
Na_chillaksharam Kha.
</quote>

> Devanagari has "Akhand",

"Akhand" feature is confusing for me, I am having a dialogue with various
people here (I am a native Malayalam speaker) and  Apurva about this. So
far I have never come across this feature in Malayalam grammar, but it
seems that the two Akhand ligatures get priority over other ligatures when
rendering. The way I tested this by asking this question

Given a font that has a glyph each for the conjuncts: KaKa, Kssa. Given
that the font contains a lookup with the following substitution rule:

Ka Halant Ka -> KaKa
Ka Halant Ssa -> Kssa
Ja Halant Nya -> Dnya
Nya Halant Nya -> Nnya

Now given a theoretical sequences: Ka Halant Ka Halant Ssa Halant Ma and
Ja Halant Nya Halant Nya Halant Ma,  How will you render them.

All of them answered that they will give priority for Akhand. But later
when I explained the concept of Akhand they were surprised. But even now
I don't know if their is any linguistic basis for clustering priorities.

> Tamil has "two-side split matra", etc. Detailed discussion of these
> features is required.

For Malayalam I take this as, U+0D4A, 4B and 4C. From what I understand
these have to be split into the corresponding component marks, ie

0D15 0D4A is first split into 0D15 0D46 0D3E

It is then reordered to

0D46 0D15 0D3E for rendering. The Tamil section of Unicode std gives more
information about this.

Any one has any idea about sorting Unicode Indic data, esp in the context
of any database? Any Unicode aware DB out there?

raj