From: Carl W. B. <cb...@xn...> - 2001-09-26 23:30:33
|
Yves, > -----Original Message----- > From: icu...@ww... > [mailto:icu...@ww...]On Behalf Of Yves > Arrouye > Sent: Tuesday, September 25, 2001 8:21 PM > To: icu list > Subject: RE: UTF-8 Macros (Was: number of chars in a UTF-8 string > > > > The most interesting extension that I have found potentially > > useful is a set of macros operating on string pointers rather > > than array+index. > > Even though I complain, in petto, every time I have to use an > index variable > instead of my nify pointers ;-), I am not sure that the > additional value of > these macros would outweight the potential confusion of ICU users > due to the > big number of UTF macros. (I just discovered (thanks Markus) > UTF8_FWD_1_SAFE > for instance, having been pretty happy with a combination of > UTF8_NEXT_CHAR_*SAFE and UTF8_CHAR_LENGTH so far.) My concern is Markus's statement: - you don't check trail bytes, which is necessary for the (upcoming) minimum-length check What is the minimum-length check? I avoid the SAFE macros because they as so slooooooooooow. I can always live with the minor hit of adding a zero index to my pointer but the safe routines do so much checking that I do not need. The safe macros do nothing but protect themselves from doing stupid things when encountering bad data. They don't report bad data but they can return erroneous results. What is the point of checking bytes in the data that do not affect the operation? Will you get a valid character count if one to the trail bytes is bad? Using the safe version is not necessarily better. UTF8_FWD_1_SAFE assumes that it is find a non-trailing byte that the character that it is processing has the wrong number of train bytes and that this is really the start of the next character. This is just as good a guess that the lead byte is good but that the character contains a bad trail byte (UTF8_FWD_1_UNSAFE). Suppose that I use xiu8_strtok to parse a string. The lead byte are good but there is a bad trail byte. When I find the token I want I will collate it. If I don't process the token that is bad my UTF-16 transforms will work just fine. If I use the safe macros in my strtok they will trash my data and give my not indication. If I have a bad lead character than it will actually report and error and the speed is as fast as the unsafe. If I do a xui8_strcoll it will transform the characters to UTF-16 this will report both leading and trailing errors because they matter and the application will know that the operation failed. You apparently use the UTF-8 manipulation macros in your code. I don't know what kind of application would use these macros and not need a full set of UTF-8 support services. It would seem to me that applications would either convert all input immediately to UTF-16 and then when writing output data it would convert to UTF-8. If you keep the data in UTF-8 I would presume that you need a full set of UTF-8 services. Carl |