RE: UTF-8 Macros (Was: number of chars in a UTF-8 string

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Yves,

> -----Original Message-----
> From: icu...@ww...
> [mailto:icu...@ww...]On Behalf Of Yves
> Arrouye
> Sent: Tuesday, September 25, 2001 8:21 PM
> To: icu list
> Subject: RE: UTF-8 Macros (Was: number of chars in a UTF-8 string
>
>
> > The most interesting extension that I have found potentially
> > useful is a set of macros operating on string pointers rather
> > than array+index.
>
> Even though I complain, in petto, every time I have to use an
> index variable
> instead of my nify pointers ;-), I am not sure that the
> additional value of
> these macros would outweight the potential confusion of ICU users
> due to the
> big number of UTF macros. (I just discovered (thanks Markus)
> UTF8_FWD_1_SAFE
> for instance, having been pretty happy with a combination of
> UTF8_NEXT_CHAR_*SAFE and UTF8_CHAR_LENGTH so far.)

My concern is Markus's statement:

- you don't check trail bytes, which is necessary for the (upcoming)
  minimum-length check

What is the minimum-length check?

I avoid the SAFE macros because they as so slooooooooooow.  I can always
live with the minor hit of adding a zero index to my pointer but the safe
routines do so much checking that I do not need.

The safe macros do nothing but protect themselves from doing stupid things
when encountering bad data.  They don't report bad data but they can return
erroneous results.  What is the point of checking bytes in the data that do
not affect the operation?  Will you get a valid character count if one to
the trail bytes is bad?

Using the safe version is not necessarily better.  UTF8_FWD_1_SAFE assumes
that it is find a non-trailing byte that the character that it is processing
has the wrong number of train bytes and that this is really the start of the
next character.  This is just as good a guess that the lead byte is good but
that the character contains a bad trail byte (UTF8_FWD_1_UNSAFE).

Suppose that I use xiu8_strtok to parse a string.  The lead byte are good
but there is a bad trail byte.  When I find the token I want I will collate
it.  If I don't process the token that is bad my UTF-16 transforms will work
just fine.  If I use the safe macros in my strtok they will trash my data
and give my not indication.  If I have a bad lead character than it will
actually report and error and the speed is as fast as the unsafe.  If I do a
xui8_strcoll it will transform the characters to UTF-16 this will report
both leading and trailing errors because they matter and the application
will know that the operation failed.

You apparently use the UTF-8 manipulation macros in your code.  I don't know
what kind of application would use these macros and not need a full set of
UTF-8 support services.  It would seem to me that applications would either
convert all input immediately to UTF-16 and then when writing output data it
would convert to UTF-8.  If you keep the data in UTF-8 I would presume that
you need a full set of UTF-8 services.

Carl

RE: UTF-8 Macros (Was: number of chars in a UTF-8 string

Open Source C/C++/Java libraries from Unicode

RE: UTF-8 Macros (Was: number of chars in a UTF-8 string