From: John Max Skaller <skaller@...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 22:51:32 +1000
> > let look s i =
> > let n' =
> > let n = Char.code s.[i] in
> > if n < 0x80 then n else
> > if 0xc2 <= n && n <= 0xdf then
> > look_code s (i + 1) 1 (n - 0xc0)
> > else if 0xe0 <= n && n <= 0xef then
> > look_code s (i + 1) 2 (n - 0xe0)
> > else if 0xf0 <= n && n <= 0xf7 then
> > look_code s (i + 1) 3 (n - 0xf0)
> > else if 0xf8 <= n && n <= 0xfb then
> > look_code s (i + 1) 4 (n - 0xf8)
> > else if 0xfc <= n && n <= 0xfd then
> > look_code s (i + 1) 5 (n - 0xfc)
> > else invalid_arg "UTF8"
> > in
> > uchar_of_int n'
>
>
> This is inefficient. you want:
>
> if n <= 0x7F then n else
> if n <= 0xc1 then invalid_arg "UTF8" else
> if n <= 0xdf then ...
It was intentional, mostly for documentation. But you may be right.
Also it would be better to unroll look_code for performance.
Performance of other functions can be improved, too, I think. (I
gradually recall the thought behind the code, which is about one year
old.)
From: John Max Skaller <skaller@...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000
> ustring -- string of UCS-4 (32 bit values)
I have UCS-4 implementation by ocaml int array, too. I can contribute
it. As for int32 vs int, is it acceptable to use in Extlib? UTF-16
needs 16-bit int array, and implementation of UCS-4 by 32-bit int
array would have an advantage for C FFI.
From: John Max Skaller <skaller@...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000
> In addition, I need to be able to convert literals.
> These things also have various C like escapes in them,
> including
>
> \uXXXX and \UXXXXXXXX
>
> escapes. My constant folder must also be able to
> concatenate strings, etc.
Is this style of escaping used in the ISO standard? I think the Unicode
book (Ver 3.2) uses \uXXXX and \vXXXXXXXX. If yours is the ISO standard,
then I would use yours. Concatenation is easy, because UTF-8 type is
currently just ocaml string. But I could add more API. Maybe all
functions appeared in String except the ones depending on locale
(casing, I mean)?
> I don't actually need
> the code to be fast, but I do need a way to enforce
> the typing properly -- I don't want to mix up
> the string and bytestring.
Ideally, I would agree you. But in current ocaml, a string literal is
a bytestring, and pattern matching don't work for an abstract type. So,
I think it is better that we retain the equality UTF8.t = string in
this stage. Anyway, I think the equivalence to ASCII string in the
case of ASCII characters is the only advantage of UTF-8. If we can
satisfy the abstract unicode string, then it would be better to use
UTF-16 or UCS-4.
--
Yamagata Yoriyuki
|