Thread: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

Brought to you by: adubey, ncannasse

ocaml-lib-devel

[Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Yamagata Y. <yor...@mb...> - 2003-06-17 15:13:55

Attachments: uChar.mli uChar.ml uTF8.mli uTF8.ml

Here are proposals for Unicode character and UTF-8 modules.  They were
part of my camomile library, and passed the random tests.

I do not know the coding convention of Extlib.  So, I am virtually
certain that there are many point that must be fixed.

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Nicolas C. <war...@fr...> - 2003-06-19 05:16:07

> Here are proposals for Unicode character and UTF-8 modules.  They were
> part of my camomile library, and passed the random tests.
>
> I do not know the coding convention of Extlib.  So, I am virtually
> certain that there are many point that must be fixed.

Thanks !
I think that one of the thing that could be done is to merge the two modules
into one , and work a little on function/type naming and better exception
handling. Once done , it should be ready to be added in ExtLib.
Since I'm not aware of unicode/utf-8 encoding, I can't comment on the
interface design or on the implementation. People who already have used such
features (in others langages as OCaml) are welcome to comment this proposal.

Nicolas Cannasse

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Remi V. <van...@la...> - 2003-06-19 08:52:16

"Nicolas Cannasse" <war...@fr...> writes:

>> Here are proposals for Unicode character and UTF-8 modules.  They were
>> part of my camomile library, and passed the random tests.
>>
>> I do not know the coding convention of Extlib.  So, I am virtually
>> certain that there are many point that must be fixed.
>
> Thanks !
> I think that one of the thing that could be done is to merge the two modules
> into one , and work a little on function/type naming and better exception
> handling. Once done , it should be ready to be added in ExtLib.
> Since I'm not aware of unicode/utf-8 encoding, I can't comment on the
> interface design or on the implementation. People who already have used such
> features (in others langages as OCaml) are welcome to comment this
> proposal.

by the way, lablgtk2 and pcre-ocaml already have some UTF-8 code
handling, so may be one can begin by looking there.
-- 
Rémi Vanicat
va...@la...
http://dept-info.labri.u-bordeaux.fr/~vanicat

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: John M. S. <sk...@oz...> - 2003-06-19 13:07:56

Remi Vanicat wrote:

> "Nicolas Cannasse" <war...@fr...> writes:
> 
> 
>>>Here are proposals for Unicode character and UTF-8 modules.  They were
>>>part of my camomile library, and passed the random tests.

> by the way, lablgtk2 and pcre-ocaml already have some UTF-8 code
> handling, so may be one can begin by looking there.

So too does Felix. The routines look good.

I'm going to check with my compiler to see
if I can do what I want. I need several concepts:

	bytestring -- string of bytes

	string -- string of text, ISO-10646/Unicode
		using UTF-8 encoding

	ustring -- string of UCS-4 (32 bit values)

In addition, I need to be able to convert literals.
These things also have various C like escapes in them,
including

	\uXXXX and \UXXXXXXXX

escapes. My constant folder must also be able to
concatenate strings, etc. I don't actually need
the code to be fast, but I do need a way to enforce
the typing properly -- I don't want to mix up
the string and bytestring.

-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Yamagata Y. <yor...@mb...> - 2003-06-19 18:45:47

Thank you for the comment.

From: "Nicolas Cannasse" <war...@fr...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 14:14:31 +0900

> I think that one of the thing that could be done is to merge the two modules
> into one

I made UChar and UTF-8 two different modules because,

  * Char and String are different modules, and I want to make unicode
  modules similar to the string API of stdlib as far as possible, and

  * Users, or eventually we, may provide other Unicode string (UTF-16,
    UCS-4, BOCU, or whatever).  Making UTF-8 an independent module
    would make changing Unicode string implementation easier.

If nested modules are acceptable, we can pack them into one module, of
course.

> work a little on function/type naming and better exception
> handling.

Is it better to define its own exception than to use invalid_arg?
Currently, UTF-8 module don't do bound checking itself for the
performance reason.  Is it better to do bound checking in UTF-8, and
raise its own exception other than Invalid_arg?

--
Yamagata Yoriyuki

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Nicolas C. <war...@fr...> - 2003-06-20 00:07:06

> > I think that one of the thing that could be done is to merge the two
modules
> > into one
>
> I made UChar and UTF-8 two different modules because,
>
>   * Char and String are different modules, and I want to make unicode
>   modules similar to the string API of stdlib as far as possible, and
>
>   * Users, or eventually we, may provide other Unicode string (UTF-16,
>     UCS-4, BOCU, or whatever).  Making UTF-8 an independent module
>     would make changing Unicode string implementation easier.
>
> If nested modules are acceptable, we can pack them into one module, of
> course.

No it's ok this way :-)

> > work a little on function/type naming and better exception
> > handling.
>
> Is it better to define its own exception than to use invalid_arg?
> Currently, UTF-8 module don't do bound checking itself for the
> performance reason.  Is it better to do bound checking in UTF-8, and
> raise its own exception other than Invalid_arg?

Depends.
ExtLib exception policy is the following :
- if exception should be catched , then use a module-specific exception
- if exception is a programmer failure, then you can use invalid_arg or
failwith
Theses statement come from the fact that invalid_arg and failwith are
type-unsafe since you need to catch them with the good string in order to
have a correct catching behavior. Using module-specific exceptions, even
with a string parameter, reduce the number of bad-catchs.

In the case of parsing (xml , unicode, etc.) the data often comes from an
external source, and the programmer will most of the time want to handle
parsing failures , theses should then not be considered as "exceptions worth
catching" and declared as module-specific exceptions.

Nicolas Cannasse

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: John M. S. <sk...@oz...> - 2003-06-19 12:51:42

Yamagata Yoriyuki wrote:

> Here are proposals for Unicode character and UTF-8 modules.  They were
> part of my camomile library, and passed the random tests.


Good start here.

> let look s i =
>   let n' =
>     let n = Char.code s.[i] in
>     if n < 0x80 then n else
>     if 0xc2 <= n && n <= 0xdf then
>       look_code s (i + 1) 1 (n - 0xc0)
>     else if 0xe0 <= n && n <= 0xef then
>       look_code s (i + 1) 2 (n - 0xe0)
>     else if 0xf0 <= n && n <= 0xf7 then
>       look_code s (i + 1) 3 (n - 0xf0)
>     else if 0xf8 <= n && n <= 0xfb then
>       look_code s (i + 1) 4 (n - 0xf8)
>     else if 0xfc <= n && n <= 0xfd then
>       look_code s (i + 1) 5 (n - 0xfc)
>     else invalid_arg "UTF8"
>   in
>   uchar_of_int n'


This is inefficient. you want:

	if n <= 0x7F then n else
	if n <= 0xc1 then invalid_arg "UTF8" else
	if n <= 0xdf then ...

i.e. you can assume the previous range test
failed, so just check for endpoints of ranges
in ascending order.
-- 
John Max Skaller, mailto:sk...@oz...
snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia.
voice:61-2-9660-0850

Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.

From: Yamagata Y. <yor...@mb...> - 2003-06-19 18:45:47

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 22:51:32 +1000

> > let look s i =
> >   let n' =
> >     let n = Char.code s.[i] in
> >     if n < 0x80 then n else
> >     if 0xc2 <= n && n <= 0xdf then
> >       look_code s (i + 1) 1 (n - 0xc0)
> >     else if 0xe0 <= n && n <= 0xef then
> >       look_code s (i + 1) 2 (n - 0xe0)
> >     else if 0xf0 <= n && n <= 0xf7 then
> >       look_code s (i + 1) 3 (n - 0xf0)
> >     else if 0xf8 <= n && n <= 0xfb then
> >       look_code s (i + 1) 4 (n - 0xf8)
> >     else if 0xfc <= n && n <= 0xfd then
> >       look_code s (i + 1) 5 (n - 0xfc)
> >     else invalid_arg "UTF8"
> >   in
> >   uchar_of_int n'
> 
> 
> This is inefficient. you want:
> 
> 	if n <= 0x7F then n else
> 	if n <= 0xc1 then invalid_arg "UTF8" else
> 	if n <= 0xdf then ...

It was intentional, mostly for documentation.  But you may be right.
Also it would be better to unroll look_code for performance.
Performance of other functions can be improved, too, I think.  (I
gradually recall the thought behind the code, which is about one year
old.)

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000

> 	ustring -- string of UCS-4 (32 bit values)

I have UCS-4 implementation by ocaml int array, too.  I can contribute
it.  As for int32 vs int, is it acceptable to use in Extlib?  UTF-16
needs 16-bit int array, and implementation of UCS-4 by 32-bit int
array would have an advantage for C FFI.

From: John Max Skaller <sk...@oz...>
Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules.
Date: Thu, 19 Jun 2003 23:07:42 +1000

> In addition, I need to be able to convert literals.
> These things also have various C like escapes in them,
> including
> 
> 	\uXXXX and \UXXXXXXXX
> 
> escapes. My constant folder must also be able to
> concatenate strings, etc.

Is this style of escaping used in the ISO standard?  I think the Unicode
book (Ver 3.2) uses \uXXXX and \vXXXXXXXX.  If yours is the ISO standard,
then I would use yours.  Concatenation is easy, because UTF-8 type is
currently just ocaml string.  But I could add more API.  Maybe all
functions appeared in String except the ones depending on locale
(casing, I mean)?  

> I don't actually need
> the code to be fast, but I do need a way to enforce
> the typing properly -- I don't want to mix up
> the string and bytestring.

Ideally, I would agree you.  But in current ocaml, a string literal is
a bytestring, and pattern matching don't work for an abstract type.  So,
I think it is better that we retain the equality UTF8.t = string in
this stage.  Anyway, I think the equivalence to ASCII string in the
case of ASCII characters is the only advantage of UTF-8.  If we can
satisfy the abstract unicode string, then it would be better to use
UTF-16 or UCS-4.

--
Yamagata Yoriyuki