From: Gerd S. <in...@ge...> - 2001-11-18 22:01:59
|
On 2001.11.16 21:04 YAMAGATA yoriyuki wrote: > Fist, I'd say that I don't propose ocamlnet, nor other protocol > implementation have to be based UCS-4. In my understanding, most > protocol is designed for byte stream, not unicode stream. Data types > in camomile are for more high level manipulation. (If posting to > ocamlnet causes confusion, I'd apologies for that.) Okay, but we need some basic data type for the interface between the protocol layer and the higher layers. > From: Gerd Stolpmann <in...@ge...> > Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. > Date: Fri, 16 Nov 2001 01:48:12 +0100 > > > I must admit that I do not have very much experience with multi-byte > > encodings, > > Me neither. My experience is almost limited as a user, not implementer. > Well, I feel my previous comments are a bit too strong. But I still > don't agree UTF-8 is the way. > > There are two different issue involved, I think. First, I'd like to > object the idea that we continue use string type for unicode, and > assuming implicitly that they are encoded by UTF-8. whether simple > or hard, in this case programmers are well aware of the fact that they > are manipulating UTF-8 encoded strings. It is much safer that > providing new abstract data type for unicode string, and whenever one > wants to work with UTF-8, one has to explicitly encode the given > unicode strings to UTF-8 strings. > > Second issue is, why not using UTF-8 as an internal representation of > unicode string? This is because > > 1) Memory management : We can not predict how much space is needed > from the number of unicode characters. Is this really a problem? > 2) In-place update, indexing : Both become inefficient, while they are > wildly used in current string manipulation in ocaml. I cannot definitely say how often in-place updates are really used, but my impression is that they are relatively seldom (and very low-level). Most code modifying strings makes copies. > 3) Presence of combined characters : To properly handle combined > characters, manipulation (regex, sorting etc.) of unicode string > become hard, regardless what representation we used for. (See below > for more discussion.) This is definitely a problem, and I hope this only concerns the higher layers and not ocamlnet. > Anyways, camomile is just an experiment. We can change internal > representation in the future. > > For Gerd's analysis, I must admit that I don't know much about UTF-8 > handling, especially regex. (UTF-8, or unicode itself, is currently > not widely used in Japan.) In fact, I don't have unicode book, though > I refers the ISO standard during implementing camomile. But here is > my comments about it. > > > - Simple: sort strings alphabetically by code points ("C" locale) > > It's simple for UCS-4, too. And if we count on compositional > character, since they have severel representation as code points, > things anyway becomes hard. > > > - Simple: Regular expressions that do not contain character ranges (i.e. > > do not use [c1-c2]). In this case existing regexp engines work. > > .... Really? Correct me if I misunderstand. I don't know well about > practice of unicode regex. > > Say the pattern "." matching every single unicode character. It > becomes "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ....". If we count on > combined characters, things becomes more complicated. Theoretically > such translation is possible (I think,) but I myself have no > confidence to implement that without bug. Now we have Vouillon's RE, > so I have some hope to make an ocaml-native unicode regex engine. "." is in most cases simply "." because you have delimiters. For example, if you want to extract everything between < and >, the regexp "<.*>" still works. Fortunately, this is true for most regexps using ".". However, if the number of characters count, the more accurate translation "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ...." must be used. For instance, if you want to match exactly four characters "...." no longer works for obvious reasons. My experience is: Simple regular expressions work in UTF-8 without causing headaches. Complicated regular expressions are either difficult to formulate, or they have bad runtime behaviour. What about the idea to have a basic ustring type that supports both encodings? This could be modeled with phantom types like in the Bigarray module. type utf8 type ucs4 type 'a ustring (* 'a either utf8 or ucs4 *) This makes it possible to reflect the representation in the type if needed, or to omit these details if they do not matter. val utf8_of_string : ?len:int -> string -> utf8 ustring val ucs4_of_string : string -> ucs4 ustring val force_utf8 : 'a ustring -> utf8 ustring val force_ucs4 : 'a ustring -> ucs4 ustring val string_of_utf8 : utf8 ustring -> string val string_of_ucs4 : ucs4 ustring -> string val string_of_any : 'a ustring -> [ `UTF8 of string | `UCS4 of string ] These functions allow arbitrary conversions. Furthermore, it is possible to access the representation directly (i.e. the underlying buffers) which is necessary for I/O and to add missing low-level operators outside the core module. val length : 'a ustring -> int (* length in characters *) val byte_length : 'a ustring -> int (* length of the representation *) val create_cursor : 'a ustring -> 'a ucursor val incr_position : 'a ucursor -> unit val decr_position : 'a ucursor -> unit val get : 'a ucursor -> uchar val set : 'a ucursor -> uchar -> unit (* slow for UTF-8 *) val get_position : 'a ucursor -> int val set_position : 'a ucursor -> int -> unit (* slow for UTF-8 *) val byte_position : 'a ucursor -> int The idea of cursors is to have a method that allows us to refer to individual characters without using character positions. Of course, there should also be an iterator: val iter : ?from:'a ucursor -> ?upto:'a ucursor -> ('a ucursor -> unit) -> 'a ustring -> unit val make_ucs4 : int -> uchar -> ucs4 ustring val make_utf8 : int -> uchar -> utf8 ustring val make_like : 'a ustring -> int -> uchar -> 'a ustring For constructors, it is necessary to select a representation. make_like creates a string with the same representation as an already existing string. I wouldn't add something like String.create (w/o initialization) because this might result in invalid strings (for both representations). val sub : 'a ustring -> 'a ucursor -> 'a ucursor -> 'a ustring Returns everything between the two cursors. Similar interfaces that use cursors instead of integer positions are possible for all of the other string functions (of course, the UTF-8 representation is sometimes slower). This design would have the advantage that one can select one of three styles and mix them in the same program: - use only UTF-8 and profit from existing libraries - use only UCS-4 and get faster code - do not specify the representation, and get better interoperability What do you think about this? Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |