From: Colin P. A. <co...@co...> - 2007-07-26 09:32:49
|
I am going to add valid_utf16le and make_from_utf16le routines. A very straight-forward implementation would be to pass Little_endian + argument to valid_utf16 and make_from_utf16 respectively. However, as this involves creating a temporary, which could be large and costly, my feeling is to forget about code sharing and write the implementations out in full. -- Colin Adams Preston Lancashire |
From: Colin P. A. <co...@co...> - 2007-07-26 13:59:12
|
>>>>> "Colin" == Colin Paul Adams <co...@co...> writes: Colin> I am going to add valid_utf16le and make_from_utf16le Colin> routines. A very straight-forward implementation would be Colin> to pass Little_endian + argument to valid_utf16 and Colin> make_from_utf16 respectively. Colin> However, as this involves creating a temporary, which could Colin> be large and costly, my feeling is to forget about code Colin> sharing and write the implementations out in full. I've done this, and also added valid_utf16be/make_from_utf16be too, for similar reasons (it's a pretty obscure use-case, but if you have a UTF-16BE string whose first character is zero-width-unbreakable-space, then you can't use make_from_utf16, as that routine will assume the first character is a BOM, and discard it, so you have to prepend the BOM yourself). I've also added tests which pass on all three compilers. May I check these in? -- Colin Adams Preston Lancashire |
From: Emmanuel S. [ES] <ma...@ei...> - 2007-07-26 14:57:02
|
Looks strange to me that the BOM is part of the string. In my opinion it should not. If I understand you well, you want to have a smart `make_from_utf16' that uses the BOM to identify wether it is little endian or big endian? You are better off having `make_from_utf16 (a_content: STRING; a_bom: STRING)' (assuming your make_from_utf16 takes a string). Regards, Manu > -----Original Message----- > From: gob...@li... > [mailto:gob...@li...] On > Behalf Of Colin Paul Adams > Sent: Thursday, July 26, 2007 6:58 AM > To: gob...@li... > Subject: Re: [gobo-eiffel-develop] UTF-16LE > > >>>>> "Colin" == Colin Paul Adams <co...@co...> writes: > > Colin> I am going to add valid_utf16le and make_from_utf16le > Colin> routines. A very straight-forward implementation would be > Colin> to pass Little_endian + argument to valid_utf16 and > Colin> make_from_utf16 respectively. > > Colin> However, as this involves creating a temporary, which could > Colin> be large and costly, my feeling is to forget about code > Colin> sharing and write the implementations out in full. > > I've done this, and also added > valid_utf16be/make_from_utf16be too, for similar reasons > (it's a pretty obscure use-case, but if you have a UTF-16BE > string whose first character is zero-width-unbreakable-space, > then you can't use make_from_utf16, as that routine will > assume the first character is a BOM, and discard it, so you > have to prepend the BOM yourself). > > I've also added tests which pass on all three compilers. > > May I check these in? > -- > Colin Adams > Preston Lancashire > > -------------------------------------------------------------- > ----------- > This SF.net email is sponsored by: Splunk Inc. > Still grepping through log files to find problems? Stop. > Now Search log events and configuration files using AJAX and > a browser. > Download your FREE copy of Splunk now >> > http://get.splunk.com/ _______________________________________________ > gobo-eiffel-develop mailing list > gob...@li... > https://lists.sourceforge.net/lists/listinfo/gobo-eiffel-develop > |
From: Colin P. A. <co...@co...> - 2007-07-26 15:21:39
|
>>>>> "Emmanuel" == Emmanuel Stapf [ES] <Emmanuel> writes: Emmanuel> Looks strange to me that the BOM is part of the Emmanuel> string. In my opinion it should not. The Unicode Consortium decide these matters, not us. Emmanuel> well, you want to have a smart `make_from_utf16' that Emmanuel> uses the BOM to identify wether it is little endian or Emmanuel> big endian? No, we already have that. Emmanuel> You are better off having `make_from_utf16 Emmanuel> (a_content: STRING; a_bom: STRING)' (assuming your Emmanuel> make_from_utf16 takes a string). No we are not better off. There are three situations: 1) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16. In this case it may start with a BOM, or it may not. If the latter, then big-endian is assumed, as per the Unicode specification. If the former, then the BOM is used to determine the endianness, and then discarded. This is creation procedure make_from_utf16. 2) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16BE. In this case no BOM is permitted. If the first two bytes indicate a zero-width non-breaking space, then that will be the first character in the string - it is not treated as a BOM and is not discarded. This is creation procedure make_from_utf16be. 3) The byte-string that we receive from wherever is in Unicode Encoding Scheme UTF-16LE. In this case no BOM is permitted. If the first two bytes indicate a zero-width non-breaking space, then that will be the first character in the string - it is not treated as a BOM and is not discarded. This is creation procedure make_from_utf16le. -- Colin Adams Preston Lancashire |
From: Emmanuel S. [ES] <ma...@ei...> - 2007-07-26 16:01:34
|
> The Unicode Consortium decide these matters, not us. But I've heard that for text file, sometime one has to figure himself what encoding is used and usually you have plenty of choice (UTF-8, UTF-16 or whatever else there is). If I understand what you just did, your class assumes that you knew that you were handling a UTF-16 string. My feeling is that in such case you also know the variant being used. If I pass a UTF-16 without a BOM and you reject it I think that's too strong. I'll let you decide, just wanted to give you my input on what I think should not be part of UC_STRING but part of a separate class that does conversion to and from verious encodings. UC_STRING should be like STRING_32, i.e. immune of any encodings. Regards, Manu |
From: Colin P. A. <co...@co...> - 2007-07-26 17:22:38
|
>>>>> "Emmanuel" == Emmanuel Stapf [ES] <Emmanuel> writes: Emmanuel> Not when I see conversions in UC_STRING. Agreed the code Emmanuel> is not in UC_STRING, but what I'm saying it should not Emmanuel> even mention it. You have the UC_UTF16_ROUTINES class Emmanuel> for that and I think it will be better left there. How? This is impossible. You cannot in general determine what the encoding of a string is. >> -----Original Message----- From: Colin Paul Adams >> [mailto:co...@co...] Sent: Thursday, July 26, 2007 >> 9:53 AM To: ma...@ei... Subject: Re: [gobo-eiffel-develop] >> UTF-16LE >> >> >>>>> "Emmanuel" == Emmanuel Stapf [ES] <Emmanuel> writes: >> >> >> The Unicode Consortium decide these matters, not us. >> Emmanuel> I'll let you decide, just wanted to give you my input on Emmanuel> what I think should not be part of UC_STRING but part of Emmanuel> a separate class that does conversion to and from Emmanuel> verious encodings. UC_STRING should be like STRING_32, Emmanuel> i.e. immune of any encodings. >> It is immune. -- Colin Adams Preston Lancashire >> -- Colin Adams Preston Lancashire |
From: Eric B. <er...@go...> - 2007-07-26 14:14:19
|
Colin Paul Adams wrote: >>>>>> "Colin" == Colin Paul Adams <co...@co...> writes: > > Colin> I am going to add valid_utf16le and make_from_utf16le > Colin> routines. A very straight-forward implementation would be > Colin> to pass Little_endian + argument to valid_utf16 and > Colin> make_from_utf16 respectively. > > Colin> However, as this involves creating a temporary, which could > Colin> be large and costly, my feeling is to forget about code > Colin> sharing and write the implementations out in full. > > I've done this, and also added valid_utf16be/make_from_utf16be too, > for similar reasons (it's a pretty obscure use-case, but if you have a > UTF-16BE string whose first character is zero-width-unbreakable-space, > then you can't use make_from_utf16, as that routine will assume the > first character is a BOM, and discard it, so you have to prepend the > BOM yourself). > > I've also added tests which pass on all three compilers. > > May I check these in? Fine with me. -- Eric Bezault mailto:er...@go... http://www.gobosoft.com |
From: Colin P. A. <co...@co...> - 2007-07-26 14:28:44
|
>>>>> "Eric" == Eric Bezault <er...@go...> writes: Eric> Colin Paul Adams wrote: >>>>>>> "Colin" == Colin Paul Adams <co...@co...> >>>>>>> writes: Colin> I am going to add valid_utf16le and make_from_utf16le Colin> routines. A very straight-forward implementation would be Colin> to pass Little_endian + argument to valid_utf16 and Colin> make_from_utf16 respectively. However, as this involves Colin> creating a temporary, which >> could Colin> be large and costly, my feeling is to forget about code Colin> sharing and write the implementations out in full. >> I've done this, and also added valid_utf16be/make_from_utf16be >> too, for similar reasons (it's a pretty obscure use-case, but >> if you have a UTF-16BE string whose first character is >> zero-width-unbreakable-space, then you can't use >> make_from_utf16, as that routine will assume the first >> character is a BOM, and discard it, so you have to prepend the >> BOM yourself). I've also added tests which pass on all three >> compilers. May I check these in? Eric> Fine with me. Done. Developers, remember to do a geant install after your next update. -- Colin Adams Preston Lancashire |
From: Emmanuel S. [ES] <ma...@ei...> - 2007-07-26 17:27:56
|
> This is impossible. You cannot in general determine what the > encoding of a string is. Either you know it or you don't. If you do not know, then the smart feature should try the various possible heuristics to see which one works. If none seems to work then you simply fail the conversion. Manu |
From: Colin P. A. <co...@co...> - 2007-07-26 17:32:03
|
>>>>> "Emmanuel" == Emmanuel Stapf [ES] <Emmanuel> writes: >> This is impossible. You cannot in general determine what the >> encoding of a string is. Emmanuel> Either you know it or you don't. The creator of a UC_STRING must know it. So he or she can call the appropriate creation routine. Emmanuel> If you do not know Emmanuel> then the smart feature should try the various possible Emmanuel> heuristics to see which one works. If none seems to work Emmanuel> then you simply fail the conversion. How do you fail? Raise an exception in the creation routine? More importantly, what if more than one work? This is quite possible. And STRING_32, as you have implemented it, is UTF-32. So it is not independent of encodings - it can't be. -- Colin Adams Preston Lancashire |