[Indic-computing-cvs-logs] SF.net SVN: indic-computing: [322] doc/trunk/en_US.ISO8859-1/books/handb

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 322
          http://indic-computing.svn.sourceforge.net/indic-computing/?rev=322&view=rev
Author:   jkoshy
Date:     2007-12-30 00:24:20 -0800 (Sun, 30 Dec 2007)

Log Message:
-----------
List a few of the problems currently reported as affecting the
Unicode standard.

Modified Paths:
--------------
    doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml

Modified: doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml
===================================================================

--- doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml	2007-12-30 08:00:04 UTC (rev 321)
+++ doc/trunk/en_US.ISO8859-1/books/handbook/std.unicode/chapter.sgml	2007-12-30 08:24:20 UTC (rev 322)
@@ -16,9 +16,78 @@
     <secondary>Unicode</secondary>
   </indexterm>
 
-  <para>This chapter will contain a discussion about the Unicode
-    standard and its support for Indian languages and scripts.</para>
+  <para>This chapter contains a discussion about the Unicode standard
+    and its support for Indian languages and scripts.</para>
 
+  <sect1>
+    <title>Introduction</title>
+    <para>
+    </para>
+  </sect1>
+
+  <sect1 id="unicode-problems">
+    <title>Problems with the standard</title>
+
+    <para>The Unicode standard has seen a fair degree of criticism
+      from Indian linguists and researchers.</para>
+
+    <sect2>
+      <title>Disputed Characters in the Standard</title>
+      <simpara>
+        The standard defines code point <codepoint
+          character-set="unicode" codepoint-name="DEVANAGARI LETTER
+          SHORT A">0904</codepoint>; however some linguists dispute
+          the existence this character in the Devanagari script (see
+          <xref linkend="unicode-gautam-sengupta">).</simpara>
+    </sect2>
+
+    <sect2>
+      <title>Missing Characters</title>
+
+      <simpara>The Marathi script uses a grapheme that is a
+        combination of a <charactername>DEVANAGARI LETTER
+        A</charactername> and a <charactername>CANDRA</charactername>
+        mark.  This grapheme is missing from the Unicode standard for
+        the Devanagari script, though a related grapheme <codepoint
+        character-set="unicode" codepoint-name="DEVANAGARI LETTER
+        CANDRA E">090D</codepoint> is present in the standard (see
+        <xref linkend="unicode-gautam-sengupta">).</simpara>
+    </sect2>
+
+    <sect2>
+      <title>Inconsistent Semantics</title>
+
+      <simpara>The published policy of the Unicode consortium is to
+        disallow use of the <codepoint character-set="unicode"
+        codepoint-name="ZERO WIDTH JOINER">200D</codepoint> (ZWJ)
+        character to encode semantic differences.  The original
+        purpose for the ZWJ was to signal possible script ligation; so
+        the underlying meaning of a sequence of Unicode characters was
+        to be independent of the presence or absence of the ZWJ
+        character inside it.</simpara>
+
+      <simpara>However this published policy was violated for the
+        Devanagari script; for this script ZWJ was defined as encoding
+        a display variants of conjunct consonants.  Encoding display
+        variants was a major deviation from the display-independent
+        nature of the Unicode standard.</simpara>
+
+      <simpara>Subsequently, for Indic scripts alone, the consortium
+        chose to define the ZWJ character as (sometimes) causing a
+	semantic distinction.</simpara>
+
+      <simpara>This implies that for indic scripts two sequences of
+        unicode codepoints that are identical except for the presence
+        of ZWJ codepoints could sometimes represent two different
+        words and could at other times represent an alternate display
+        form of the same word. This inconsistency makes processing
+        indic text difficult, for example, see <xref
+        linkend="unicode-gautam-sengupta"> for an example of the
+        complications faced when implementing a Marathi spell
+        checker.</simpara>
+    </sect2>
+
+  </sect1>
 </chapter>
 
 <!--


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Indic-computing-cvs-logs] SF.net SVN: indic-computing: [322] doc/trunk/en_US.ISO8859-1/books/handb

[Indic-computing-cvs-logs] SF.net SVN: indic-computing: [322] doc/trunk/en_US.ISO8859-1/books/handbook /std.unicode/chapter.sgml