You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(23) |
Nov
(34) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(6) |
Feb
(1) |
Mar
(12) |
Apr
|
May
(3) |
Jun
(3) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
2003 |
Jan
|
Feb
(6) |
Mar
(1) |
Apr
|
May
|
Jun
(1) |
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2004 |
Jan
|
Feb
|
Mar
(10) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
(2) |
Feb
(3) |
Mar
|
Apr
(9) |
May
(17) |
Jun
(14) |
Jul
(13) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
|
Dec
(5) |
2006 |
Jan
|
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
(1) |
Oct
(16) |
Nov
(5) |
Dec
|
2007 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
2008 |
Jan
(14) |
Feb
(5) |
Mar
(7) |
Apr
(3) |
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(6) |
Mar
(9) |
Apr
(2) |
May
(1) |
Jun
|
Jul
|
Aug
(17) |
Sep
(2) |
Oct
(1) |
Nov
(4) |
Dec
|
2010 |
Jan
|
Feb
(3) |
Mar
(21) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(5) |
Jul
(23) |
Aug
(7) |
Sep
|
Oct
|
Nov
|
Dec
(9) |
2012 |
Jan
(7) |
Feb
(9) |
Mar
(2) |
Apr
(2) |
May
(5) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
(9) |
Oct
|
Nov
(3) |
Dec
(2) |
2013 |
Jan
(4) |
Feb
|
Mar
(4) |
Apr
(1) |
May
(1) |
Jun
(4) |
Jul
(4) |
Aug
(6) |
Sep
(15) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2014 |
Jan
(1) |
Feb
|
Mar
(7) |
Apr
(2) |
May
(8) |
Jun
|
Jul
|
Aug
(4) |
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(2) |
2015 |
Jan
(6) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
(6) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
(7) |
Dec
|
2016 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Patrick M D. <pa...@wa...> - 2001-11-25 19:28:54
|
I have given some more thought to event-based dispatch and feel that it is too important to ignore. Support for threading on Windows is not fully supported and event-based programs can be easier to reason about. I believe that Gerd was previously in favor of an event-based interface. Any other opinions? Here are some other thoughts related to event-based dispatch: Currently, the pop interface has a general form like this for transactions: val xact : 'request -> 'response The event-based model would look like this: val xact' : 'request -> ('response -> unit) -> unit where the transaction returns immediately. So a block of code written in this style: let rsp1 = xact "req1" in let rsp2 = xact "req2" in ... becomes xact' "req1" (fun rsp1 -> xact' "req2" (fun rsp2 -> ... ); ); I notice that the type for xact' is very similar to the monadic bind operator. Maybe it would be useful to write a camlp4 extension similar to the syntactic sugar in Haskell for monads? Patrick |
From: Guillaume V. <gui...@va...> - 2001-11-25 11:56:13
|
hi, I have a project to do in ocaml, it's about the DNS protocol. I made something dirty in order to learn the protocol, and it works. So it can be done in a clean way. I like the way they did in the Net::DNS module of Perl but i'd like to know other opinions. It will look like this: let query = new dns_query "sieste.org A" in query#send (); print_string (query#answer_to_string ());; Not only queries will be support but response too. So it could be possible to built a tiny DNS server. The work will be finish in mid-january. bye, guillaume -- mailto:gui...@va... ICQ uin : 1752110 Page ouebe : http://guillaume.valadon.net "No! Try not. Do. Or do not. There is no try." - Yoda |
From: YAMAGATA y. <yor...@ms...> - 2001-11-20 23:44:25
|
From: Gerd Stolpmann <in...@ge...> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. Date: Sun, 18 Nov 2001 23:01:42 +0100 > Okay, but we need some basic data type for the interface between the > protocol layer and the higher layers. Some protocols, say a domain name or URL, are going to require unicode for internationalization. But for most of protocols, encoding * string as you proposed, is sufficient. Or just string * string (a first string for encoding name) would be better, to avoid the case that a message is rejected just because ocamlnet doesn't know its encoding. As for the idea using unicode for everything, I think this has some risk, because there are some character set not currently supported by unicode. For example, JIS0213, the recent Japanese standard, contains a lot of characters not encoded in unicode. (registration is now in progress.) I have heard something similar about Chinese Big-5 encoding. I don't think such extension is widely used, though. In addition, some rare cases, translation to unicode loses information in the original encoded string. For example, iso-2022-jp-2 make distinction between Chinese, Japanese, Korean ideographs, while in unicode, they are unified. This is undesirable behaviour for a protocol layer. Though, again, this usually doesn't cause a problem, because iso-2022-jp-2 is used for mainly Japanese texts. As Patrick said, for many application we don't need to decode encoded strings. So, if there is no particular reason to prefer unicode, just throw decoding task on higher layers. This leaves more choice to a user. > What about the idea to have a basic ustring type that supports both > encodings? This could be modeled with phantom types like in the Bigarray module. I reach the similar conclusion. But I thought about using OOP. Say provide the virtual classes ustorage -- for all ucs string-like data types. only allows the access using cursors. uindexed -- virtual class in which indexing is possible. umutable -- virtual class allowing in place update. and make the hierarchy as ustorage -> uindexed -> umutable | | | -> utf8, -> utext -> ustring utf16 and the similar hierarchy for cursors. (I'm not sure this is possible, though. I don't know well about ocaml object system.) What is the advantage of a phantom type? -- yori |
From: Patrick M D. <pa...@wa...> - 2001-11-19 17:29:24
|
I tried using ocamldoc on the Ocamlnet sources yesterday. This is much easier to do now that comments after identifiers are supported. It lacks support for '*' in multiline comments but I checked in with Maxence and this should be supported in the future. This means that we should be able to make use of the tool simply by changing the comments to begin with (** instead of (* while still keeping the current formatting conventions. Once that functionaliy is available, I think we should start modifying the sources as appropriate to use it. Patrick |
From: Gerd S. <in...@ge...> - 2001-11-18 22:01:59
|
On 2001.11.16 21:04 YAMAGATA yoriyuki wrote: > Fist, I'd say that I don't propose ocamlnet, nor other protocol > implementation have to be based UCS-4. In my understanding, most > protocol is designed for byte stream, not unicode stream. Data types > in camomile are for more high level manipulation. (If posting to > ocamlnet causes confusion, I'd apologies for that.) Okay, but we need some basic data type for the interface between the protocol layer and the higher layers. > From: Gerd Stolpmann <in...@ge...> > Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. > Date: Fri, 16 Nov 2001 01:48:12 +0100 > > > I must admit that I do not have very much experience with multi-byte > > encodings, > > Me neither. My experience is almost limited as a user, not implementer. > Well, I feel my previous comments are a bit too strong. But I still > don't agree UTF-8 is the way. > > There are two different issue involved, I think. First, I'd like to > object the idea that we continue use string type for unicode, and > assuming implicitly that they are encoded by UTF-8. whether simple > or hard, in this case programmers are well aware of the fact that they > are manipulating UTF-8 encoded strings. It is much safer that > providing new abstract data type for unicode string, and whenever one > wants to work with UTF-8, one has to explicitly encode the given > unicode strings to UTF-8 strings. > > Second issue is, why not using UTF-8 as an internal representation of > unicode string? This is because > > 1) Memory management : We can not predict how much space is needed > from the number of unicode characters. Is this really a problem? > 2) In-place update, indexing : Both become inefficient, while they are > wildly used in current string manipulation in ocaml. I cannot definitely say how often in-place updates are really used, but my impression is that they are relatively seldom (and very low-level). Most code modifying strings makes copies. > 3) Presence of combined characters : To properly handle combined > characters, manipulation (regex, sorting etc.) of unicode string > become hard, regardless what representation we used for. (See below > for more discussion.) This is definitely a problem, and I hope this only concerns the higher layers and not ocamlnet. > Anyways, camomile is just an experiment. We can change internal > representation in the future. > > For Gerd's analysis, I must admit that I don't know much about UTF-8 > handling, especially regex. (UTF-8, or unicode itself, is currently > not widely used in Japan.) In fact, I don't have unicode book, though > I refers the ISO standard during implementing camomile. But here is > my comments about it. > > > - Simple: sort strings alphabetically by code points ("C" locale) > > It's simple for UCS-4, too. And if we count on compositional > character, since they have severel representation as code points, > things anyway becomes hard. > > > - Simple: Regular expressions that do not contain character ranges (i.e. > > do not use [c1-c2]). In this case existing regexp engines work. > > .... Really? Correct me if I misunderstand. I don't know well about > practice of unicode regex. > > Say the pattern "." matching every single unicode character. It > becomes "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ....". If we count on > combined characters, things becomes more complicated. Theoretically > such translation is possible (I think,) but I myself have no > confidence to implement that without bug. Now we have Vouillon's RE, > so I have some hope to make an ocaml-native unicode regex engine. "." is in most cases simply "." because you have delimiters. For example, if you want to extract everything between < and >, the regexp "<.*>" still works. Fortunately, this is true for most regexps using ".". However, if the number of characters count, the more accurate translation "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ...." must be used. For instance, if you want to match exactly four characters "...." no longer works for obvious reasons. My experience is: Simple regular expressions work in UTF-8 without causing headaches. Complicated regular expressions are either difficult to formulate, or they have bad runtime behaviour. What about the idea to have a basic ustring type that supports both encodings? This could be modeled with phantom types like in the Bigarray module. type utf8 type ucs4 type 'a ustring (* 'a either utf8 or ucs4 *) This makes it possible to reflect the representation in the type if needed, or to omit these details if they do not matter. val utf8_of_string : ?len:int -> string -> utf8 ustring val ucs4_of_string : string -> ucs4 ustring val force_utf8 : 'a ustring -> utf8 ustring val force_ucs4 : 'a ustring -> ucs4 ustring val string_of_utf8 : utf8 ustring -> string val string_of_ucs4 : ucs4 ustring -> string val string_of_any : 'a ustring -> [ `UTF8 of string | `UCS4 of string ] These functions allow arbitrary conversions. Furthermore, it is possible to access the representation directly (i.e. the underlying buffers) which is necessary for I/O and to add missing low-level operators outside the core module. val length : 'a ustring -> int (* length in characters *) val byte_length : 'a ustring -> int (* length of the representation *) val create_cursor : 'a ustring -> 'a ucursor val incr_position : 'a ucursor -> unit val decr_position : 'a ucursor -> unit val get : 'a ucursor -> uchar val set : 'a ucursor -> uchar -> unit (* slow for UTF-8 *) val get_position : 'a ucursor -> int val set_position : 'a ucursor -> int -> unit (* slow for UTF-8 *) val byte_position : 'a ucursor -> int The idea of cursors is to have a method that allows us to refer to individual characters without using character positions. Of course, there should also be an iterator: val iter : ?from:'a ucursor -> ?upto:'a ucursor -> ('a ucursor -> unit) -> 'a ustring -> unit val make_ucs4 : int -> uchar -> ucs4 ustring val make_utf8 : int -> uchar -> utf8 ustring val make_like : 'a ustring -> int -> uchar -> 'a ustring For constructors, it is necessary to select a representation. make_like creates a string with the same representation as an already existing string. I wouldn't add something like String.create (w/o initialization) because this might result in invalid strings (for both representations). val sub : 'a ustring -> 'a ucursor -> 'a ucursor -> 'a ustring Returns everything between the two cursors. Similar interfaces that use cursors instead of integer positions are possible for all of the other string functions (of course, the UTF-8 representation is sometimes slower). This design would have the advantage that one can select one of three styles and mix them in the same program: - use only UTF-8 and profit from existing libraries - use only UCS-4 and get faster code - do not specify the representation, and get better interoperability What do you think about this? Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: Patrick M D. <pa...@wa...> - 2001-11-18 15:07:26
|
I would like to make sure the design of the POP module seems appropriate. If so, I can easily bring over code for SMTP and NNTP that follows the same model. It occurs to me that returning an in_obj_channel for methods 'retr' and 'top' may not be the best idea. In particular, there is no way to guarantee that the user will read all the data from the channel. This doesn't matter for the current implementation since it reads the entire message before constructing the channel, but I would certainly want to change that in the future. An alternative option is to pass a callback function to the method like this: method retr : msgno:int -> (string -> unit) -> unit The function would be called for every line of text in the message. It could be nice to have a more fold-like interface: method retr : msgno -> (string -> 'a -> 'b) -> 'a -> 'b but this of course would not work in the OO-framework. Perhaps the module-based interface is better? There is also the issue of an event-based interface. IMAP will require an event-based interface because unsolicited data can arrive at almost any time. I apologize for taking so much time on a rather simple protocol, but my hope is to establish a good design approach that can extend to many other protocols giving a consistent feel for the user. Thanks for any help or comments! Patrick |
From: Patrick M D. <pa...@wa...> - 2001-11-18 14:50:24
|
On Sun, 18 Nov 2001, Gerd Stolpmann wrote: > Ok, there is now a release called ocamlnet-0.91. It is currently hidden, and > you find it only on the admin page. Can you check it, too? The release looks right to me, so I have changed its status to active. Good work! Patrick |
From: Gerd S. <in...@ge...> - 2001-11-18 14:21:17
|
On 2001.11.18 01:13 Patrick M Doane wrote: > On Sun, 18 Nov 2001, Gerd Stolpmann wrote: > > If this is accepted, I would suggest to release the current code before > > adding new features. > > Since the primary chunk of code that is stable is CGI and netstring, I'll > let you make the call on when to finalize a release. > > If you haven't made a release before on SourceForge, and would like me to > upload the files, I'm familiar with doing that. One confusing thing is > that files do not show up for a release immediately (they are refreshed > every half hour). Ok, there is now a release called ocamlnet-0.91. It is currently hidden, and you find it only on the admin page. Can you check it, too? Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: Patrick M D. <pa...@wa...> - 2001-11-18 00:13:59
|
On Sun, 18 Nov 2001, Gerd Stolpmann wrote: > Hi, > > I have changed the signature of the [url] method (Netcgi_types): > where other_url_spec = [ `Env | `This of string | `None ] (instead of > just > bool). `This is the new third value, and it simplifies making URLs with > different path_info or script_name. I'm in favor of this change. It makes the method much more useful and code is trivial to change to meet the new interface. > If this is accepted, I would suggest to release the current code before > adding new features. Since the primary chunk of code that is stable is CGI and netstring, I'll let you make the call on when to finalize a release. If you haven't made a release before on SourceForge, and would like me to upload the files, I'm familiar with doing that. One confusing thing is that files do not show up for a release immediately (they are refreshed every half hour). Patrick |
From: Gerd S. <in...@ge...> - 2001-11-17 23:32:44
|
Hi, I have changed the signature of the [url] method (Netcgi_types): method url : ?protocol:Netcgi_env.protocol -> (* default: from environment *) ?with_authority:other_url_spec -> (* default: `Env *) ?with_script_name:other_url_spec -> (* default: `Env *) ?with_path_info:other_url_spec -> (* default: `Env *) ?with_query_string:query_string_spec -> (* default: `None *) unit -> string where other_url_spec = [ `Env | `This of string | `None ] (instead of just bool). `This is the new third value, and it simplifies making URLs with different path_info or script_name. If this is accepted, I would suggest to release the current code before adding new features. Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: YAMAGATA y. <yor...@mb...> - 2001-11-16 20:00:20
|
From: Patrick M Doane <pa...@wa...> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. Date: Wed, 14 Nov 2001 16:56:45 -0500 (EST) > I'm not aware of the implementation details for the Netmapping module. > Would it be possible to add Japanese encodings to that? If that is > possible to do, I'm sure it would be a welcome addition to the > Netstring code. Netmapping module only works for charset in which all character is 1-byte, so it's not possible to add Japanese encodings. Adding Japanese encodings to Netconversion module, though, are possible. However, codes for Japanese encoding support is very large (> 1M source), due to large conversion table. And we need to restrict usage of some function, as 1) makechar doesn't works for iso-2022-*, because they are stateful encoding. (Representation of the same character differs with its location.) 2) Using recode function for stateful encodings would be dangerous, because recode doesn't report the final state of encoding. It is possible someone write a code which doesn't aware of the issue of statefulness. 3) Current form of substitution would be dangerous by the similar reason. If these restriction and problems are acceptable, I'm willing to works for porting Japanese support to netstring. Currently I'm desperately busy (writing PhD thesis). So I can't begin real works instantly. -- yori |
From: YAMAGATA y. <yor...@mb...> - 2001-11-16 19:59:57
|
Fist, I'd say that I don't propose ocamlnet, nor other protocol implementation have to be based UCS-4. In my understanding, most protocol is designed for byte stream, not unicode stream. Data types in camomile are for more high level manipulation. (If posting to ocamlnet causes confusion, I'd apologies for that.) From: Gerd Stolpmann <in...@ge...> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. Date: Fri, 16 Nov 2001 01:48:12 +0100 > I must admit that I do not have very much experience with multi-byte > encodings, Me neither. My experience is almost limited as a user, not implementer. Well, I feel my previous comments are a bit too strong. But I still don't agree UTF-8 is the way. There are two different issue involved, I think. First, I'd like to object the idea that we continue use string type for unicode, and assuming implicitly that they are encoded by UTF-8. whether simple or hard, in this case programmers are well aware of the fact that they are manipulating UTF-8 encoded strings. It is much safer that providing new abstract data type for unicode string, and whenever one wants to work with UTF-8, one has to explicitly encode the given unicode strings to UTF-8 strings. Second issue is, why not using UTF-8 as an internal representation of unicode string? This is because 1) Memory management : We can not predict how much space is needed from the number of unicode characters. 2) In-place update, indexing : Both become inefficient, while they are wildly used in current string manipulation in ocaml. 3) Presence of combined characters : To properly handle combined characters, manipulation (regex, sorting etc.) of unicode string become hard, regardless what representation we used for. (See below for more discussion.) Anyways, camomile is just an experiment. We can change internal representation in the future. For Gerd's analysis, I must admit that I don't know much about UTF-8 handling, especially regex. (UTF-8, or unicode itself, is currently not widely used in Japan.) In fact, I don't have unicode book, though I refers the ISO standard during implementing camomile. But here is my comments about it. > - Simple: sort strings alphabetically by code points ("C" locale) It's simple for UCS-4, too. And if we count on compositional character, since they have severel representation as code points, things anyway becomes hard. > - Simple: Regular expressions that do not contain character ranges (i.e. > do not use [c1-c2]). In this case existing regexp engines work. .... Really? Correct me if I misunderstand. I don't know well about practice of unicode regex. Say the pattern "." matching every single unicode character. It becomes "[\0x00-\0x7f]|\([\0xc0-\0xdf].\)| ....". If we count on combined characters, things becomes more complicated. Theoretically such translation is possible (I think,) but I myself have no confidence to implement that without bug. Now we have Vouillon's RE, so I have some hope to make an ocaml-native unicode regex engine. -- yori |
From: YAMAGATA y. <yor...@ms...> - 2001-11-16 19:59:51
|
From: Gerd Stolpmann <in...@ge...> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. Date: Fri, 16 Nov 2001 01:48:12 +0100 > the need for internationalization within ocamlnet is quite limited. Agreed. > As the type for the internationalized string I would suggest > (encoding * string). Fine. > Do you know a certain protocol that requires more? Content negotiation for HTTP, and CGI (I'm not sure it is possible) is handy. |
From: Patrick M D. <pa...@wa...> - 2001-11-16 16:58:11
|
On Fri, 16 Nov 2001, Gerd Stolpmann wrote: > > On 2001.11.15 17:32 YAMAGATA yoriyuki wrote: > > I'm interested in a plan about internationalization support of > > ocamlnet. Is it just out of the scope? > > It's not, but the need for internationalization within ocamlnet is > quite limited. Today it is possible for many network protocols to > specify a certain character set, and good implementations will > recognize the character set and signal it to the user, but that's it > already for the protocol library. For example, there are several > mechanisms for emails to specify the character set (one for the > message body, and two for the message header (RFC2047 and RFC2184)). > It is enough for the protocol library to switch to pairs > (character_set X contents) instead of using strings with undefined > encodings. In Mimestring, RFC2047 has been implemented by adding > EncodedWord as a triple (character_set X transfer_encoding X > contents). The worst case is that it is necessary to convert such > strings to an ASCII-compatible encoding in the case it contains some > keywords to be recognized. Of course, if you write a mail user agent > you have to do more, but this is out of scope of ocamlnet. This is a good example to discuss. While developing an e-mail user agent, I needed to handle encoded content in many headers (e.g. addresses). The address parsing code in Ocamlnet is based on that e-mail agent. My design was to keep the address parsing code ignorant with regard to encoded content. This design simplified the parser and data structures, as well as code that wants to manipulate the addresses. When my GUI wants to finally display an address, it uses Mimestring again to handle the encoding. Now, if there had been good support for Unicode, I might have adopted a different design with the parser being aware of EncodedWords. Both approaches do seem to have their advantages. > Do you know a certain protocol that requires more? I'm not aware of any, at least within the scope of what I feel that Ocamlnet should cover (i.e. IETF-like protocols and specifications). This might become much more of an issue if we were to look at W3-like protocols, but that is probably best suited for a different project. > So I think ocamlnet should only contain the very basic definitions: > > - A type enumerating the available encodings > > - Conversion between encodings > > It is unlikely that we need more for ocamlnet, and these definitions > may also be seen as the minimal interface to other libraries that > support several encodings, too. As the type for the internationalized > string I would suggest (encoding * string). I agree. How feasible would it be to add the Japanese encodings to the Netmappings table? Patrick |
From: Gerd S. <in...@ge...> - 2001-11-16 00:48:36
|
On 2001.11.15 17:32 YAMAGATA yoriyuki wrote: > From: Gerd Stolpmann <in...@ge...> > Subject: Re: [Ocamlnet-devel] an experimental impelmentation of > Unicode support. > Date: Wed, 14 Nov 2001 22:42:03 +0100 > > > Do you think this is the way to go? There is very much code that > does > > not > > work with UCS characters, e.g. ocaml string literals, ocamllex, > regexp > > engines etc. and everything built from this tools. Isn't it better > to > > use UTF-8? > > Perhaps, most ocaml programs assume a string consists latin-1. So > UTF-8 can still be dangerous. For example, many programmers write > string literals using latin-1 encoding, which turn to malformed UTF-8 > strings or wrong strings, if they contains non-ASCII European > characters. I think most ocaml programs do not assume anything about the code points >= 0x80. In the standard library, only String.lowercase and String.uppercase assume Latin-1, and everything else behaves in a transparent way. Programs with user interfaces require localization anyway, so let's talk about libraries. They are more interesting because they can be theoretically shared by programs assuming different character sets. I think this is already true for many of them if the encoding is ASCII-compatible (meaning: The bytes from 0x00 to 0x7f denote always the ASCII code points regardless where they occur). For example, the Mimestring module of ocamlnet can parse MIME headers if the encoding is ASCII-compatible. This works because the regular expression let header_re = Str.regexp "([^ \t\r\n:]+):([ \t]*.*\n([ \t].*\n)*)";; does not assume any structure for non-ASCII characters. This is why I think that UTF-8 (being ASCII-compatible) is the simplest way to make Unicode available for a not-so-small set of libraries, including especially networking libraries. This might not be true for tasks that require real text processing because regular expressions for UTF-8 can be expensive (see below). > And supporting non-ASCII character using UTF-8 seems > quite tricky. This depends on the basic operations you need: - Simple: concatentation - Simple: sort strings alphabetically by code points ("C" locale) - Simple: Regular expressions that do not contain character ranges (i.e. do not use [c1-c2]). In this case existing regexp engines work. - Medium: Iterate over the characters of a string (could be simplified by a utf8_cursor abstract type) - Medium: Input/output. Often requires conversion. - Hard: Access strings by index. - Hard: Regular expressions with character ranges. It is not only complicated to formulate them (see PXP for a solution), but the resulting automatons become huge and almost unusable. But there are workarounds for exactly this case (wlex) - Hard: Sort strings by locale > In my opinion, there is no other way than implementing > everything from scratch, and if we do so, we can freely choose > internal representation, not limited in UTF-8, UTF-16 or > UCS4. (In my library, Ustring uses just UCS4, but Utext and Ubuffer do > some optimization. For example, an ASCII text consumes only > 1byte/char.) I must admit that I do not have very much experience with multi-byte encodings, so my knowledge is quite theoretical; especially regarding text processing outside artificial languages. The only bigger library I wrote that required some Unicode design is PXP, my XML parser. It works with UTF-8, and although there were some problems it was always possible to find a good solution. And I could use a lot of existing tools (e.g. lex). > I'm interested in a plan about internationalization support of > ocamlnet. Is it just out of the scope? It's not, but the need for internationalization within ocamlnet is quite limited. Today it is possible for many network protocols to specify a certain character set, and good implementations will recognize the character set and signal it to the user, but that's it already for the protocol library. For example, there are several mechanisms for emails to specify the character set (one for the message body, and two for the message header (RFC2047 and RFC2184)). It is enough for the protocol library to switch to pairs (character_set X contents) instead of using strings with undefined encodings. In Mimestring, RFC2047 has been implemented by adding EncodedWord as a triple (character_set X transfer_encoding X contents). The worst case is that it is necessary to convert such strings to an ASCII-compatible encoding in the case it contains some keywords to be recognized. Of course, if you write a mail user agent you have to do more, but this is out of scope of ocamlnet. Do you know a certain protocol that requires more? So I think ocamlnet should only contain the very basic definitions: - A type enumerating the available encodings - Conversion between encodings It is unlikely that we need more for ocamlnet, and these definitions may also be seen as the minimal interface to other libraries that support several encodings, too. As the type for the internationalized string I would suggest (encoding * string). Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: Gerd S. <in...@ge...> - 2001-11-15 22:27:52
|
On 2001.11.15 17:27 YAMAGATA yoriyuki wrote: > As posted in Bug Tracker, computation of UTF-16 in Netconversion is, > I think, wrong. Subtraction by offset is left out. You are right. I have fixed the problem. Sorry that nobody has looked into the bug tracker. I have now configured that a mail is sent to ocamlnet-devel for every new bug so that we notice them. Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: Florian H. <fl...@ha...> - 2001-11-15 17:44:50
|
On Wed, Nov 14, 2001 at 04:56:45PM -0500, Patrick M Doane wrote: > I also agree with Gerd's analysis regarding UCS and UTF-8. Any particular > reasons for UCS? One of the advantages of UCS4 vs UTF-8 and UTF-16 is that you can get at the character in position n in O(1) instead of O(n) and replacing that character will never require you to move the rest of the string by some bytes (so a string of n characters will always require 4*n bytes, instead of anyting between n and 6*n (UTF-8) or 2*n and 4*n (UTF-16, note that this encoding does not allow the encoding of the whole Unicode codespace, you can't use most of the unused parts). Yours, Florian. |
From: Patrick M D. <pa...@wa...> - 2001-11-15 16:49:53
|
On Fri, 16 Nov 2001, YAMAGATA yoriyuki wrote: > I'm interested in a plan about internationalization support of > ocamlnet. Is it just out of the scope? I think this would be very good to add. It wasn't clear to me from your initial message if you'd like to help make that happen for Ocamlnet. It seems that most of us have had experiences primarily with single-byte languages so we would certainly need some help to make sure it is done correctly. Patrick |
From: YAMAGATA y. <yor...@ms...> - 2001-11-15 16:34:59
|
From: Gerd Stolpmann <in...@ge...> Subject: Re: [Ocamlnet-devel] an experimental impelmentation of Unicode support. Date: Wed, 14 Nov 2001 22:42:03 +0100 > Do you think this is the way to go? There is very much code that does > not > work with UCS characters, e.g. ocaml string literals, ocamllex, regexp > engines etc. and everything built from this tools. Isn't it better to > use UTF-8? Perhaps, most ocaml programs assume a string consists latin-1. So UTF-8 can still be dangerous. For example, many programmers write string literals using latin-1 encoding, which turn to malformed UTF-8 strings or wrong strings, if they contains non-ASCII European characters. And supporting non-ASCII character using UTF-8 seems quite tricky. In my opinion, there is no other way than implementing everything from scratch, and if we do so, we can freely choose internal representation, not limited in UTF-8, UTF-16 or UCS4. (In my library, Ustring uses just UCS4, but Utext and Ubuffer do some optimization. For example, an ASCII text consumes only 1byte/char.) > Personally, I am not interested, but I am quite sure this project finds > its > audience. I'm interested in a plan about internationalization support of ocamlnet. Is it just out of the scope? -- yori |
From: YAMAGATA y. <yor...@ms...> - 2001-11-15 16:34:44
|
As posted in Bug Tracker, computation of UTF-16 in Netconversion is, I think, wrong. Subtraction by offset is left out. -- yori |
From: Patrick M D. <pa...@wa...> - 2001-11-14 21:57:01
|
On Wed, 14 Nov 2001, Gerd Stolpmann wrote: > On 2001.11.14 00:10 YAMAGATA yoriyuki wrote: > > It is not directly related to ocamlnet, but I think you may be > > interest in it, especially because it provides Japanese encoding > > support. > > I left this out in netstring because I don't know very much about > Japanese encodings and hoped that somebody would do it. > > > 5) Char_encoding : an implementation of various character encodings, > > as > > UTF-8, UTF16-BE/LE, All encodings provided Netmapping module > > in > > Netstring library, All major Japanese encodings (iso-2022-jp, > > euc-jp, sjis). > > > If someone is interested in development of this library, I will make a > > Sourceforge project. Contribution and comments are welcome. > > Personally, I am not interested, but I am quite sure this project finds > its audience. I'm not aware of the implementation details for the Netmapping module. Would it be possible to add Japanese encodings to that? If that is possible to do, I'm sure it would be a welcome addition to the Netstring code. I also agree with Gerd's analysis regarding UCS and UTF-8. Any particular reasons for UCS? Patrick |
From: Gerd S. <in...@ge...> - 2001-11-14 21:42:51
|
On 2001.11.14 00:10 YAMAGATA yoriyuki wrote: > It is not directly related to ocamlnet, but I think you may be > interest in it, especially because it provides Japanese encoding > support. I left this out in netstring because I don't know very much about Japanese encodings and hoped that somebody would do it. > I have made an experimental library for Unicode support, which > currently provides the following modules. > > 1) Uchar : a bare minimal implementation of ISO-UCS characters. > 2) Ustring : a mutable string for UCS characters. > 3) Utext : an immutable string for UCS characters. > 4) Ubuffer : an extensible buffer for UCS characters. Do you think this is the way to go? There is very much code that does not work with UCS characters, e.g. ocaml string literals, ocamllex, regexp engines etc. and everything built from this tools. Isn't it better to use UTF-8? > 5) Char_encoding : an implementation of various character encodings, > as > UTF-8, UTF16-BE/LE, All encodings provided Netmapping module > in > Netstring library, All major Japanese encodings (iso-2022-jp, > euc-jp, sjis). Cool. > you can obtain this from > http://www.ms.u-tokyo.ac.jp/~yoriyuki/camomile.tar.gz > > If someone is interested in development of this library, I will make a > Sourceforge project. Contribution and comments are welcome. Personally, I am not interested, but I am quite sure this project finds its audience. Gerd -- ---------------------------------------------------------------------------- Gerd Stolpmann Telefon: +49 6151 997705 (privat) Viktoriastr. 45 64293 Darmstadt EMail: ge...@ge... Germany ---------------------------------------------------------------------------- |
From: YAMAGATA y. <yor...@ma...> - 2001-11-14 10:25:57
|
(* Appologies if someone recieve this mail twice. I have sent this from a wrong address. *) It is not directly related to ocamlnet, but I think you may be interest in it, especially because it provides Japanese encoding support. I have made an experimental library for Unicode support, which currently provides the following modules. 1) Uchar : a bare minimal implementation of ISO-UCS characters. 2) Ustring : a mutable string for UCS characters. 3) Utext : an immutable string for UCS characters. 4) Ubuffer : an extensible buffer for UCS characters. 5) Char_encoding : an implementation of various character encodings, as UTF-8, UTF16-BE/LE, All encodings provided Netmapping module in Netstring library, All major Japanese encodings (iso-2022-jp, euc-jp, sjis). you can obtain this from http://www.ms.u-tokyo.ac.jp/~yoriyuki/camomile.tar.gz If someone is interested in development of this library, I will make a Sourceforge project. Contribution and comments are welcome. -- YAMAGATA, yoriyuki (doctoral student) Department of Mathematical Science, University of Tokyo. |
From: YAMAGATA y. <yor...@ms...> - 2001-11-13 23:06:20
|
It is not directly related to ocamlnet, but I think you may be interest in it, especially because it provides Japanese encoding support. I have made an experimental library for Unicode support, which currently provides the following modules. 1) Uchar : a bare minimal implementation of ISO-UCS characters. 2) Ustring : a mutable string for UCS characters. 3) Utext : an immutable string for UCS characters. 4) Ubuffer : an extensible buffer for UCS characters. 5) Char_encoding : an implementation of various character encodings, as UTF-8, UTF16-BE/LE, All encodings provided Netmapping module in Netstring library, All major Japanese encodings (iso-2022-jp, euc-jp, sjis). you can obtain this from http://www.ms.u-tokyo.ac.jp/~yoriyuki/camomile.tar.gz If someone is interested in development of this library, I will make a Sourceforge project. Contribution and comments are welcome. -- YAMAGATA, yoriyuki (doctoral student) Department of Mathematical Science, University of Tokyo. |
From: Florian H. <fl...@ha...> - 2001-11-13 07:57:16
|
On Sun, Nov 11, 2001 at 03:44:25PM -0500, Patrick M Doane wrote: > On Sun, 11 Nov 2001, Gerd Stolpmann wrote: > > The license is still open. > Well timed given the recent discussion on the caml list. People interested in these questions might also find this article: http://www.advogato.org/article/376.html interesting. Yours, Florian Hars. |