From: Ram V. <ra...@jt...> - 2001-05-25 18:48:38
|
Alan, The docs do say the \U notation has to be padded by leading zeors, but makeconv accepts <Uhhhhhh> to <Uhhhhhhhh> format that led me to believe that we should accept 5 to 8 characters. But according to your email it seems like the right thing to do is have the width fixed at 8. Regards, Ram Alan Liu wrote: > That's correct. This was by design, in order to conform to the C/C++ spec. To quote from Markus' original proposal to this list (dated exactly one year ago today): > > Markus wrote: > >- The new (draft) C and C++ standards specify > > \uhhhh (fixed-length, 4 hex digits) (like Java) > > \Uhhhhhhhh (fixed-length, 8 hex digits) > >- Perl and Kermit use > > \x{hh...h} (variable-length, 1..8 hex digits) > > > >I propose that we add both to our parsing. > > We ended up implementing the \u and \U escapes, as described, but not the Perl/Kermit stuff. So both \u and \U expect fixed-length hex numbers. It's trivial to change \U to support 5..8 (or even 1..8) hex digits if that is, in fact, the right thing to do. > > Alan > > At 12:43 PM 5/24/2001 -0700, Carl W. Brown wrote: > >Ram, > > > >It seems that \Uhhhhhhhh is the format not \Uhhhhhh according to the docs. > >Thus you always have to have two leading zeros. > > > >Carl > > > >-----Original Message----- > >From: Ram Viswanadha [mailto:ra...@jt...] > >Sent: Thursday, May 24, 2001 10:11 AM > >To: Alan Liu > >Cc: Carl W. Brown; icu list > >Subject: Re: Proposal: Unicode Hex representations > > > > > >Alan, > >you are right, but I just verified that u_unescapeAt has a bug and does not > >handle \Uhhhhhh notation correctly. I am > >in the > >process of fixing the bug. > > > >Carl, > >You are right that genrb doesnot handle non-bmp codepoints, but that is a > >bug, genrb casts UChar32 to UChar at a few > >places we need to change that. > > > >Regards. > >Ram > > > >Alan Liu wrote: > > > >> There is already underlying support for this functionality in icu4c: > >> > >> UnicodeString UnicodeString::unescape() const; > >> > >> UChar32 UnicodeString::unescapeAt(int32_t &offset) const; > >> > >> U_CAPI int32_t U_EXPORT2 > >> u_unescape(const char *src, > >> UChar *dest, int32_t destCapacity); > >> > >> U_CAPI UChar32 U_EXPORT2 > >> u_unescapeAt(UNESCAPE_CHAR_AT charAt, > >> int32_t *offset, > >> int32_t length, > >> void *context); > >> > >> The syntax it supports is \uxxxx and \Uxxxxxxxx, as well as \xhh and \ooo > >and the standard ANSI C escapes like \n. > >> > >> Alan Liu - IBM > >> > >> At 08:57 AM 5/24/2001 -0700, Carl W. Brown wrote: > >> >Expires 5/31/01 > >> > > >> >Some hex representations like .ucm files will have no problem supporting > >> >non-plane0 Unicode characters: > >> > > >> ><U215C> \xA8\xFC |0 > >> > > >> >This is because the hex fields are delimited or extensible. > >> > > >> >The locale resource files, however, use the \uxxxx format. You can not > >> >encode \uxxxxxx because the format implies 4 hex digits only. > >> > > >> >Proposal: > >> > > >> >Use an uppercase 'U' to designate a 6 hex digit format and a lowercase > >'u' > >> >for the 4 hex digit format. The characters from 0000 to FFFF will be > >> >encoded as \uxxxx and 10000 to 10FFFF will be encoded as \Uxxxxxx. > >Encoding > >> >0000 to FFFF as \U00xxxx is allowed but encoding 10000 to 10FFFF > >characters > >> >as a pair of \uxxxx\uxxxx surrogate codes should be strongly discouraged. > >> > > >> >I am developing code that I want to be ICU compatible, thus I would like > >to > >> >see this proposal adopted for ICU. > >> > > >> >Carl > >> > > >> > > >> >_______________________________________________ > >> >icu mailing list > >> >ic...@os... > >> > >>http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu > >> > >> _______________________________________________ > >> icu mailing list > >> ic...@os... > >> http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu |