From: Yamagata Y. <yor...@mb...> - 2003-06-17 15:13:55
|
Here are proposals for Unicode character and UTF-8 modules. They were part of my camomile library, and passed the random tests. I do not know the coding convention of Extlib. So, I am virtually certain that there are many point that must be fixed. -- Yamagata Yoriyuki |
From: Nicolas C. <war...@fr...> - 2003-06-19 05:16:07
|
> Here are proposals for Unicode character and UTF-8 modules. They were > part of my camomile library, and passed the random tests. > > I do not know the coding convention of Extlib. So, I am virtually > certain that there are many point that must be fixed. Thanks ! I think that one of the thing that could be done is to merge the two modules into one , and work a little on function/type naming and better exception handling. Once done , it should be ready to be added in ExtLib. Since I'm not aware of unicode/utf-8 encoding, I can't comment on the interface design or on the implementation. People who already have used such features (in others langages as OCaml) are welcome to comment this proposal. Nicolas Cannasse |
From: Remi V. <van...@la...> - 2003-06-19 08:52:16
|
"Nicolas Cannasse" <war...@fr...> writes: >> Here are proposals for Unicode character and UTF-8 modules. They were >> part of my camomile library, and passed the random tests. >> >> I do not know the coding convention of Extlib. So, I am virtually >> certain that there are many point that must be fixed. > > Thanks ! > I think that one of the thing that could be done is to merge the two modules > into one , and work a little on function/type naming and better exception > handling. Once done , it should be ready to be added in ExtLib. > Since I'm not aware of unicode/utf-8 encoding, I can't comment on the > interface design or on the implementation. People who already have used such > features (in others langages as OCaml) are welcome to comment this > proposal. by the way, lablgtk2 and pcre-ocaml already have some UTF-8 code handling, so may be one can begin by looking there. -- Rémi Vanicat va...@la... http://dept-info.labri.u-bordeaux.fr/~vanicat |
From: John M. S. <sk...@oz...> - 2003-06-19 13:07:56
|
Remi Vanicat wrote: > "Nicolas Cannasse" <war...@fr...> writes: > > >>>Here are proposals for Unicode character and UTF-8 modules. They were >>>part of my camomile library, and passed the random tests. > by the way, lablgtk2 and pcre-ocaml already have some UTF-8 code > handling, so may be one can begin by looking there. So too does Felix. The routines look good. I'm going to check with my compiler to see if I can do what I want. I need several concepts: bytestring -- string of bytes string -- string of text, ISO-10646/Unicode using UTF-8 encoding ustring -- string of UCS-4 (32 bit values) In addition, I need to be able to convert literals. These things also have various C like escapes in them, including \uXXXX and \UXXXXXXXX escapes. My constant folder must also be able to concatenate strings, etc. I don't actually need the code to be fast, but I do need a way to enforce the typing properly -- I don't want to mix up the string and bytestring. -- John Max Skaller, mailto:sk...@oz... snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 |
From: Yamagata Y. <yor...@mb...> - 2003-06-19 18:45:47
|
Thank you for the comment. From: "Nicolas Cannasse" <war...@fr...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 14:14:31 +0900 > I think that one of the thing that could be done is to merge the two modules > into one I made UChar and UTF-8 two different modules because, * Char and String are different modules, and I want to make unicode modules similar to the string API of stdlib as far as possible, and * Users, or eventually we, may provide other Unicode string (UTF-16, UCS-4, BOCU, or whatever). Making UTF-8 an independent module would make changing Unicode string implementation easier. If nested modules are acceptable, we can pack them into one module, of course. > work a little on function/type naming and better exception > handling. Is it better to define its own exception than to use invalid_arg? Currently, UTF-8 module don't do bound checking itself for the performance reason. Is it better to do bound checking in UTF-8, and raise its own exception other than Invalid_arg? -- Yamagata Yoriyuki |
From: Nicolas C. <war...@fr...> - 2003-06-20 00:07:06
|
> > I think that one of the thing that could be done is to merge the two modules > > into one > > I made UChar and UTF-8 two different modules because, > > * Char and String are different modules, and I want to make unicode > modules similar to the string API of stdlib as far as possible, and > > * Users, or eventually we, may provide other Unicode string (UTF-16, > UCS-4, BOCU, or whatever). Making UTF-8 an independent module > would make changing Unicode string implementation easier. > > If nested modules are acceptable, we can pack them into one module, of > course. No it's ok this way :-) > > work a little on function/type naming and better exception > > handling. > > Is it better to define its own exception than to use invalid_arg? > Currently, UTF-8 module don't do bound checking itself for the > performance reason. Is it better to do bound checking in UTF-8, and > raise its own exception other than Invalid_arg? Depends. ExtLib exception policy is the following : - if exception should be catched , then use a module-specific exception - if exception is a programmer failure, then you can use invalid_arg or failwith Theses statement come from the fact that invalid_arg and failwith are type-unsafe since you need to catch them with the good string in order to have a correct catching behavior. Using module-specific exceptions, even with a string parameter, reduce the number of bad-catchs. In the case of parsing (xml , unicode, etc.) the data often comes from an external source, and the programmer will most of the time want to handle parsing failures , theses should then not be considered as "exceptions worth catching" and declared as module-specific exceptions. Nicolas Cannasse |
From: John M. S. <sk...@oz...> - 2003-06-19 12:51:42
|
Yamagata Yoriyuki wrote: > Here are proposals for Unicode character and UTF-8 modules. They were > part of my camomile library, and passed the random tests. Good start here. > let look s i = > let n' = > let n = Char.code s.[i] in > if n < 0x80 then n else > if 0xc2 <= n && n <= 0xdf then > look_code s (i + 1) 1 (n - 0xc0) > else if 0xe0 <= n && n <= 0xef then > look_code s (i + 1) 2 (n - 0xe0) > else if 0xf0 <= n && n <= 0xf7 then > look_code s (i + 1) 3 (n - 0xf0) > else if 0xf8 <= n && n <= 0xfb then > look_code s (i + 1) 4 (n - 0xf8) > else if 0xfc <= n && n <= 0xfd then > look_code s (i + 1) 5 (n - 0xfc) > else invalid_arg "UTF8" > in > uchar_of_int n' This is inefficient. you want: if n <= 0x7F then n else if n <= 0xc1 then invalid_arg "UTF8" else if n <= 0xdf then ... i.e. you can assume the previous range test failed, so just check for endpoints of ranges in ascending order. -- John Max Skaller, mailto:sk...@oz... snail:10/1 Toxteth Rd, Glebe, NSW 2037, Australia. voice:61-2-9660-0850 |
From: Yamagata Y. <yor...@mb...> - 2003-06-19 18:45:47
|
From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 22:51:32 +1000 > > let look s i = > > let n' = > > let n = Char.code s.[i] in > > if n < 0x80 then n else > > if 0xc2 <= n && n <= 0xdf then > > look_code s (i + 1) 1 (n - 0xc0) > > else if 0xe0 <= n && n <= 0xef then > > look_code s (i + 1) 2 (n - 0xe0) > > else if 0xf0 <= n && n <= 0xf7 then > > look_code s (i + 1) 3 (n - 0xf0) > > else if 0xf8 <= n && n <= 0xfb then > > look_code s (i + 1) 4 (n - 0xf8) > > else if 0xfc <= n && n <= 0xfd then > > look_code s (i + 1) 5 (n - 0xfc) > > else invalid_arg "UTF8" > > in > > uchar_of_int n' > > > This is inefficient. you want: > > if n <= 0x7F then n else > if n <= 0xc1 then invalid_arg "UTF8" else > if n <= 0xdf then ... It was intentional, mostly for documentation. But you may be right. Also it would be better to unroll look_code for performance. Performance of other functions can be improved, too, I think. (I gradually recall the thought behind the code, which is about one year old.) From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 23:07:42 +1000 > ustring -- string of UCS-4 (32 bit values) I have UCS-4 implementation by ocaml int array, too. I can contribute it. As for int32 vs int, is it acceptable to use in Extlib? UTF-16 needs 16-bit int array, and implementation of UCS-4 by 32-bit int array would have an advantage for C FFI. From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 23:07:42 +1000 > In addition, I need to be able to convert literals. > These things also have various C like escapes in them, > including > > \uXXXX and \UXXXXXXXX > > escapes. My constant folder must also be able to > concatenate strings, etc. Is this style of escaping used in the ISO standard? I think the Unicode book (Ver 3.2) uses \uXXXX and \vXXXXXXXX. If yours is the ISO standard, then I would use yours. Concatenation is easy, because UTF-8 type is currently just ocaml string. But I could add more API. Maybe all functions appeared in String except the ones depending on locale (casing, I mean)? > I don't actually need > the code to be fast, but I do need a way to enforce > the typing properly -- I don't want to mix up > the string and bytestring. Ideally, I would agree you. But in current ocaml, a string literal is a bytestring, and pattern matching don't work for an abstract type. So, I think it is better that we retain the equality UTF8.t = string in this stage. Anyway, I think the equivalence to ASCII string in the case of ASCII characters is the only advantage of UTF-8. If we can satisfy the abstract unicode string, then it would be better to use UTF-16 or UCS-4. -- Yamagata Yoriyuki |