Revision: 322
http://indic-computing.svn.sourceforge.net/indic-computing/?rev=322&view=rev
Author: jkoshy
Date: 2007-12-30 00:24:20 -0800 (Sun, 30 Dec 2007)
Log Message:
-----------
List a few of the problems currently reported as affecting the
Unicode standard.
Modified Paths:
--------------
doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml
Modified: doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml
===================================================================
--- doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml 2007-12-30 08:00:04 UTC (rev 321)
+++ doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml 2007-12-30 08:24:20 UTC (rev 322)
@@ -16,9 +16,78 @@
<secondary>Unicode</secondary>
</indexterm>
- <para>This chapter will contain a discussion about the Unicode
- standard and its support for Indian languages and scripts.</para>
+ <para>This chapter contains a discussion about the Unicode standard
+ and its support for Indian languages and scripts.</para>
+ <sect1>
+ <title>Introduction</title>
+ <para>
+ </para>
+ </sect1>
+
+ <sect1 id="unicode-problems">
+ <title>Problems with the standard</title>
+
+ <para>The Unicode standard has seen a fair degree of criticism
+ from Indian linguists and researchers.</para>
+
+ <sect2>
+ <title>Disputed Characters in the Standard</title>
+ <simpara>
+ The standard defines code point <codepoint
+ character-set="unicode" codepoint-name="DEVANAGARI LETTER
+ SHORT A">0904</codepoint>; however some linguists dispute
+ the existence this character in the Devanagari script (see
+ <xref linkend="unicode-gautam-sengupta">).</simpara>
+ </sect2>
+
+ <sect2>
+ <title>Missing Characters</title>
+
+ <simpara>The Marathi script uses a grapheme that is a
+ combination of a <charactername>DEVANAGARI LETTER
+ A</charactername> and a <charactername>CANDRA</charactername>
+ mark. This grapheme is missing from the Unicode standard for
+ the Devanagari script, though a related grapheme <codepoint
+ character-set="unicode" codepoint-name="DEVANAGARI LETTER
+ CANDRA E">090D</codepoint> is present in the standard (see
+ <xref linkend="unicode-gautam-sengupta">).</simpara>
+ </sect2>
+
+ <sect2>
+ <title>Inconsistent Semantics</title>
+
+ <simpara>The published policy of the Unicode consortium is to
+ disallow use of the <codepoint character-set="unicode"
+ codepoint-name="ZERO WIDTH JOINER">200D</codepoint> (ZWJ)
+ character to encode semantic differences. The original
+ purpose for the ZWJ was to signal possible script ligation; so
+ the underlying meaning of a sequence of Unicode characters was
+ to be independent of the presence or absence of the ZWJ
+ character inside it.</simpara>
+
+ <simpara>However this published policy was violated for the
+ Devanagari script; for this script ZWJ was defined as encoding
+ a display variants of conjunct consonants. Encoding display
+ variants was a major deviation from the display-independent
+ nature of the Unicode standard.</simpara>
+
+ <simpara>Subsequently, for Indic scripts alone, the consortium
+ chose to define the ZWJ character as (sometimes) causing a
+ semantic distinction.</simpara>
+
+ <simpara>This implies that for indic scripts two sequences of
+ unicode codepoints that are identical except for the presence
+ of ZWJ codepoints could sometimes represent two different
+ words and could at other times represent an alternate display
+ form of the same word. This inconsistency makes processing
+ indic text difficult, for example, see <xref
+ linkend="unicode-gautam-sengupta"> for an example of the
+ complications faced when implementing a Marathi spell
+ checker.</simpara>
+ </sect2>
+
+ </sect1>
</chapter>
<!--
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|