Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 22:51:32 +1000

> > let look s i =
> >   let n' =
> >     let n = Char.code s.[i] in
> >     if n < 0x80 then n else
> >     if 0xc2 <= n && n <= 0xdf then
> >       look_code s (i + 1) 1 (n - 0xc0)
> >     else if 0xe0 <= n && n <= 0xef then
> >       look_code s (i + 1) 2 (n - 0xe0)
> >     else if 0xf0 <= n && n <= 0xf7 then
> >       look_code s (i + 1) 3 (n - 0xf0)
> >     else if 0xf8 <= n && n <= 0xfb then
> >       look_code s (i + 1) 4 (n - 0xf8)
> >     else if 0xfc <= n && n <= 0xfd then
> >       look_code s (i + 1) 5 (n - 0xfc)
> >     else invalid_arg "UTF8"
> >   in
> >   uchar_of_int n'
> 
> 
> This is inefficient. you want:
> 
> 	if n <= 0x7F then n else
> 	if n <= 0xc1 then invalid_arg "UTF8" else
> 	if n <= 0xdf then ...

It was intentional, mostly for documentation.  But you may be right.
Also it would be better to unroll look_code for performance.
Performance of other functions can be improved, too, I think.  (I
gradually recall the thought behind the code, which is about one year
old.)

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000

> 	ustring -- string of UCS-4 (32 bit values)

I have UCS-4 implementation by ocaml int array, too.  I can contribute
it.  As for int32 vs int, is it acceptable to use in Extlib?  UTF-16
needs 16-bit int array, and implementation of UCS-4 by 32-bit int
array would have an advantage for C FFI.

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000

> In addition, I need to be able to convert literals.
> These things also have various C like escapes in them,
> including
> 
> 	\uXXXX and \UXXXXXXXX
> 
> escapes. My constant folder must also be able to
> concatenate strings, etc.

Is this style of escaping used in the ISO standard?  I think the Unicode
book (Ver 3.2) uses \uXXXX and \vXXXXXXXX.  If yours is the ISO standard,
then I would use yours.  Concatenation is easy, because UTF-8 type is
currently just ocaml string.  But I could add more API.  Maybe all
functions appeared in String except the ones depending on locale
(casing, I mean)?  

> I don't actually need
> the code to be fast, but I do need a way to enforce
> the typing properly -- I don't want to mix up
> the string and bytestring.

Ideally, I would agree you.  But in current ocaml, a string literal is
a bytestring, and pattern matching don't work for an abstract type.  So,
I think it is better that we retain the equality UTF8.t = string in
this stage.  Anyway, I think the equivalence to ASCII string in the
case of ASCII characters is the only advantage of UTF-8.  If we can
satisfy the abstract unicode string, then it would be better to use
UTF-16 or UCS-4.

--
Yamagata Yoriyuki