From: Yamagata Y. <yor...@mb...> - 2003-06-19 18:45:47
|
From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 22:51:32 +1000 > > let look s i = > > let n' = > > let n = Char.code s.[i] in > > if n < 0x80 then n else > > if 0xc2 <= n && n <= 0xdf then > > look_code s (i + 1) 1 (n - 0xc0) > > else if 0xe0 <= n && n <= 0xef then > > look_code s (i + 1) 2 (n - 0xe0) > > else if 0xf0 <= n && n <= 0xf7 then > > look_code s (i + 1) 3 (n - 0xf0) > > else if 0xf8 <= n && n <= 0xfb then > > look_code s (i + 1) 4 (n - 0xf8) > > else if 0xfc <= n && n <= 0xfd then > > look_code s (i + 1) 5 (n - 0xfc) > > else invalid_arg "UTF8" > > in > > uchar_of_int n' > > > This is inefficient. you want: > > if n <= 0x7F then n else > if n <= 0xc1 then invalid_arg "UTF8" else > if n <= 0xdf then ... It was intentional, mostly for documentation. But you may be right. Also it would be better to unroll look_code for performance. Performance of other functions can be improved, too, I think. (I gradually recall the thought behind the code, which is about one year old.) From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 23:07:42 +1000 > ustring -- string of UCS-4 (32 bit values) I have UCS-4 implementation by ocaml int array, too. I can contribute it. As for int32 vs int, is it acceptable to use in Extlib? UTF-16 needs 16-bit int array, and implementation of UCS-4 by 32-bit int array would have an advantage for C FFI. From: John Max Skaller <sk...@oz...> Subject: Re: [Ocaml-lib-devel] A proposal for Unicode character and UTF-8 modules. Date: Thu, 19 Jun 2003 23:07:42 +1000 > In addition, I need to be able to convert literals. > These things also have various C like escapes in them, > including > > \uXXXX and \UXXXXXXXX > > escapes. My constant folder must also be able to > concatenate strings, etc. Is this style of escaping used in the ISO standard? I think the Unicode book (Ver 3.2) uses \uXXXX and \vXXXXXXXX. If yours is the ISO standard, then I would use yours. Concatenation is easy, because UTF-8 type is currently just ocaml string. But I could add more API. Maybe all functions appeared in String except the ones depending on locale (casing, I mean)? > I don't actually need > the code to be fast, but I do need a way to enforce > the typing properly -- I don't want to mix up > the string and bytestring. Ideally, I would agree you. But in current ocaml, a string literal is a bytestring, and pattern matching don't work for an abstract type. So, I think it is better that we retain the equality UTF8.t = string in this stage. Anyway, I think the equivalence to ASCII string in the case of ASCII characters is the only advantage of UTF-8. If we can satisfy the abstract unicode string, then it would be better to use UTF-16 or UCS-4. -- Yamagata Yoriyuki |