Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 2001.11.16 21:04 YAMAGATA yoriyuki wrote:
> Fist, I'd say that I don't propose ocamlnet, nor other protocol
> implementation have to be based UCS-4.  In my understanding, most
> protocol is designed for byte stream, not unicode stream.  Data types
> in camomile are for more high level manipulation.  (If posting to
> ocamlnet causes confusion, I'd apologies for that.)

Okay, but we need some basic data type for the interface between the
protocol layer and the higher layers.

> From: Gerd Stolpmann <in...@ge...>
> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support.
> Date: Fri, 16 Nov 2001 01:48:12 +0100
> 
> > I must admit that I do not have very much experience with multi-byte 
> > encodings,
> 
> Me neither.  My experience is almost limited as a user, not implementer.
> Well, I feel my previous comments are a bit too strong.  But I still
> don't agree UTF-8 is the way.
> 
> There are two different issue involved, I think.  First, I'd like to
> object the idea that we continue use string type for unicode, and
> assuming implicitly that they are encoded by UTF-8.  whether simple
> or hard, in this case programmers are well aware of the fact that they
> are manipulating UTF-8 encoded strings.  It is much safer that
> providing new abstract data type for unicode string, and whenever one
> wants to work with UTF-8, one has to explicitly encode the given
> unicode strings to UTF-8 strings.
> 
> Second issue is, why not using UTF-8 as an internal representation of
> unicode string?  This is because
> 
> 1) Memory management : We can not predict how much space is needed
> from the number of unicode characters.

Is this really a problem?

> 2) In-place update, indexing : Both become inefficient, while they are
> wildly used in current string manipulation in ocaml.

I cannot definitely say how often in-place updates are really used, but
my impression is that they are relatively seldom (and very low-level). 
Most code modifying strings makes copies.

> 3) Presence of combined characters : To properly handle combined
> characters, manipulation (regex, sorting etc.) of unicode string
> become hard, regardless what representation we used for.  (See below
> for more discussion.)

This is definitely a problem, and I hope this only concerns the higher
layers and not ocamlnet.

> Anyways, camomile is just an experiment.  We can change internal
> representation in the future.
> 
> For Gerd's analysis, I must admit that I don't know much about UTF-8
> handling, especially regex. (UTF-8, or unicode itself, is currently
> not widely used in Japan.) In fact, I don't have unicode book, though
> I refers the ISO standard during implementing camomile.  But here is
> my comments about it.
> 
> > - Simple: sort strings alphabetically by code points ("C" locale)
> 
> It's simple for UCS-4, too.  And if we count on compositional
> character, since they have severel representation as code points,
> things anyway becomes hard.
> 
> > - Simple: Regular expressions that do not contain character ranges (i.e.
> >    do not use [c1-c2]). In this case existing regexp engines work.
> 
> .... Really?  Correct me if I misunderstand.  I don't know well about
> practice of unicode regex.
> 
> Say the pattern "." matching every single unicode character.  It
> becomes  "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ....". If we count on
> combined characters, things becomes more complicated.  Theoretically
> such translation is possible (I think,) but I myself have no
> confidence to implement that without bug.  Now we have Vouillon's RE,
> so I have some hope to make an ocaml-native unicode regex engine.

"." is in most cases simply "." because you have delimiters. For example,
if you want to extract everything between < and >, the regexp
"<.*>" still works. Fortunately, this is true for most regexps using ".".
However, if the number of characters count, the more accurate translation
"[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ...." must be used. For instance, if
you want to match exactly four characters "...." no longer works for
obvious reasons.

My experience is: Simple regular expressions work in UTF-8 without causing
headaches. Complicated regular expressions are either difficult to formulate,
or they have bad runtime behaviour.

What about the idea to have a basic ustring type that supports both
encodings? This could be modeled with phantom types like in the Bigarray module.

type utf8
type ucs4

type 'a ustring 
     (* 'a either utf8 or ucs4 *)

This makes it possible to reflect the representation in the type if needed,
or to omit these details if they do not matter.

val utf8_of_string : ?len:int -> string -> utf8 ustring
val ucs4_of_string : string -> ucs4 ustring
val force_utf8 : 'a ustring -> utf8 ustring
val force_ucs4 : 'a ustring -> ucs4 ustring
val string_of_utf8 : utf8 ustring -> string
val string_of_ucs4 : ucs4 ustring -> string
val string_of_any : 'a ustring -> [ `UTF8 of string | `UCS4 of string ]

These functions allow arbitrary conversions. Furthermore, it is possible to
access the representation directly (i.e. the underlying buffers) 
which is necessary for I/O and to add missing low-level operators 
outside the core module.

val length : 'a ustring -> int           (* length in characters *)
val byte_length : 'a ustring -> int      (* length of the representation *)

val create_cursor : 'a ustring -> 'a ucursor
val incr_position : 'a ucursor -> unit
val decr_position : 'a ucursor -> unit
val get : 'a ucursor -> uchar
val set : 'a ucursor -> uchar -> unit         (* slow for UTF-8 *)
val get_position : 'a ucursor -> int
val set_position : 'a ucursor -> int -> unit  (* slow for UTF-8 *)
val byte_position : 'a ucursor -> int

The idea of cursors is to have a method that allows us to refer to individual
characters without using character positions. Of course, there should also be
an iterator:

val iter : ?from:'a ucursor -> 
           ?upto:'a ucursor -> 
           ('a ucursor -> unit) -> 
           'a ustring ->
              unit

val make_ucs4 : int -> uchar -> ucs4 ustring
val make_utf8 : int -> uchar -> utf8 ustring
val make_like : 'a ustring -> int -> uchar -> 'a ustring

For constructors, it is necessary to select a representation. make_like
creates a string with the same representation as an already existing string.
I wouldn't add something like String.create (w/o initialization) because this
might result in invalid strings (for both representations).

val sub : 'a ustring -> 'a ucursor -> 'a ucursor -> 'a ustring

Returns everything between the two cursors. Similar interfaces that use
cursors instead of integer positions are possible for all of the other
string functions (of course, the UTF-8 representation is sometimes slower).

This design would have the advantage that one can select one of three styles
and mix them in the same program:

- use only UTF-8 and profit from existing libraries
- use only UCS-4 and get faster code
- do not specify the representation, and get better interoperability

What do you think about this?

Gerd
-- 
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 45             
64293 Darmstadt     EMail:   ge...@ge...
Germany                     
----------------------------------------------------------------------------