[Indic-computing-cvs-logs] SF.net SVN: indic-computing: [322] doc/trunk/en_US.ISO8859-1/books/handb
Status: Alpha
Brought to you by:
jkoshy
From: <jk...@us...> - 2007-12-30 08:24:16
|
Revision: 322 http://indic-computing.svn.sourceforge.net/indic-computing/?rev=322&view=rev Author: jkoshy Date: 2007-12-30 00:24:20 -0800 (Sun, 30 Dec 2007) Log Message: ----------- List a few of the problems currently reported as affecting the Unicode standard. Modified Paths: -------------- doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml Modified: doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml =================================================================== --- doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml 2007-12-30 08:00:04 UTC (rev 321) +++ doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml 2007-12-30 08:24:20 UTC (rev 322) @@ -16,9 +16,78 @@ <secondary>Unicode</secondary> </indexterm> - <para>This chapter will contain a discussion about the Unicode - standard and its support for Indian languages and scripts.</para> + <para>This chapter contains a discussion about the Unicode standard + and its support for Indian languages and scripts.</para> + <sect1> + <title>Introduction</title> + <para> + </para> + </sect1> + + <sect1 id="unicode-problems"> + <title>Problems with the standard</title> + + <para>The Unicode standard has seen a fair degree of criticism + from Indian linguists and researchers.</para> + + <sect2> + <title>Disputed Characters in the Standard</title> + <simpara> + The standard defines code point <codepoint + character-set="unicode" codepoint-name="DEVANAGARI LETTER + SHORT A">0904</codepoint>; however some linguists dispute + the existence this character in the Devanagari script (see + <xref linkend="unicode-gautam-sengupta">).</simpara> + </sect2> + + <sect2> + <title>Missing Characters</title> + + <simpara>The Marathi script uses a grapheme that is a + combination of a <charactername>DEVANAGARI LETTER + A</charactername> and a <charactername>CANDRA</charactername> + mark. This grapheme is missing from the Unicode standard for + the Devanagari script, though a related grapheme <codepoint + character-set="unicode" codepoint-name="DEVANAGARI LETTER + CANDRA E">090D</codepoint> is present in the standard (see + <xref linkend="unicode-gautam-sengupta">).</simpara> + </sect2> + + <sect2> + <title>Inconsistent Semantics</title> + + <simpara>The published policy of the Unicode consortium is to + disallow use of the <codepoint character-set="unicode" + codepoint-name="ZERO WIDTH JOINER">200D</codepoint> (ZWJ) + character to encode semantic differences. The original + purpose for the ZWJ was to signal possible script ligation; so + the underlying meaning of a sequence of Unicode characters was + to be independent of the presence or absence of the ZWJ + character inside it.</simpara> + + <simpara>However this published policy was violated for the + Devanagari script; for this script ZWJ was defined as encoding + a display variants of conjunct consonants. Encoding display + variants was a major deviation from the display-independent + nature of the Unicode standard.</simpara> + + <simpara>Subsequently, for Indic scripts alone, the consortium + chose to define the ZWJ character as (sometimes) causing a + semantic distinction.</simpara> + + <simpara>This implies that for indic scripts two sequences of + unicode codepoints that are identical except for the presence + of ZWJ codepoints could sometimes represent two different + words and could at other times represent an alternate display + form of the same word. This inconsistency makes processing + indic text difficult, for example, see <xref + linkend="unicode-gautam-sengupta"> for an example of the + complications faced when implementing a Marathi spell + checker.</simpara> + </sect2> + + </sect1> </chapter> <!-- This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |