Re: Proposal: Unicode Hex representations

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Alan,
The docs do say the \U notation has to be padded by leading zeors, but makeconv accepts <Uhhhhhh> to <Uhhhhhhhh>
format that led me to believe that we should accept  5 to 8 characters. But according to your email it seems like the right thing to do is have the width fixed at 8.

Regards,
Ram

Alan Liu wrote:

> That's correct.  This was by design, in order to conform to the C/C++ spec.  To quote from Markus' original proposal to this list (dated exactly one year ago today):
>
> Markus wrote:
> >- The new (draft) C and C++ standards specify
> >  \uhhhh (fixed-length, 4 hex digits) (like Java)
> >  \Uhhhhhhhh (fixed-length, 8 hex digits)
> >- Perl and Kermit use
> >  \x{hh...h} (variable-length, 1..8 hex digits)
> >
> >I propose that we add both to our parsing.
>
> We ended up implementing the \u and \U escapes, as described, but not the Perl/Kermit stuff.  So both \u and \U expect fixed-length hex numbers.  It's trivial to change \U to support 5..8 (or even 1..8) hex digits if that is, in fact, the right thing to do.
>
> Alan
>
> At 12:43 PM 5/24/2001 -0700, Carl W. Brown wrote:
> >Ram,
> >
> >It seems that \Uhhhhhhhh is the format not \Uhhhhhh according to the docs.
> >Thus you always have to have two leading zeros.
> >
> >Carl
> >
> >-----Original Message-----
> >From: Ram Viswanadha [mailto:ra...@jt...]
> >Sent: Thursday, May 24, 2001 10:11 AM
> >To: Alan Liu
> >Cc: Carl W. Brown; icu list
> >Subject: Re: Proposal: Unicode Hex representations
> >
> >
> >Alan,
> >you are right, but I just verified that u_unescapeAt has a bug and does not
> >handle \Uhhhhhh notation correctly. I am
> >in the
> >process of fixing the bug.
> >
> >Carl,
> >You are right that genrb doesnot handle non-bmp codepoints, but that is a
> >bug, genrb casts UChar32 to UChar at a few
> >places we need to change that.
> >
> >Regards.
> >Ram
> >
> >Alan Liu wrote:
> >
> >> There is already underlying support for this functionality in icu4c:
> >>
> >>   UnicodeString UnicodeString::unescape() const;
> >>
> >>   UChar32 UnicodeString::unescapeAt(int32_t &offset) const;
> >>
> >>   U_CAPI int32_t U_EXPORT2
> >>   u_unescape(const char *src,
> >>              UChar *dest, int32_t destCapacity);
> >>
> >>   U_CAPI UChar32 U_EXPORT2
> >>   u_unescapeAt(UNESCAPE_CHAR_AT charAt,
> >>                int32_t *offset,
> >>                int32_t length,
> >>                void *context);
> >>
> >> The syntax it supports is \uxxxx and \Uxxxxxxxx, as well as \xhh and \ooo
> >and the standard ANSI C escapes like \n.
> >>
> >> Alan Liu - IBM
> >>
> >> At 08:57 AM 5/24/2001 -0700, Carl W. Brown wrote:
> >> >Expires 5/31/01
> >> >
> >> >Some hex representations like .ucm files will have no problem supporting
> >> >non-plane0 Unicode characters:
> >> >
> >> ><U215C> \xA8\xFC |0
> >> >
> >> >This is because the hex fields are delimited or extensible.
> >> >
> >> >The locale resource files, however, use the \uxxxx format.  You can not
> >> >encode \uxxxxxx because the format implies 4 hex digits only.
> >> >
> >> >Proposal:
> >> >
> >> >Use an uppercase 'U' to designate a 6 hex digit format and a lowercase
> >'u'
> >> >for the 4 hex digit format.  The characters from 0000 to FFFF will be
> >> >encoded as \uxxxx and 10000 to 10FFFF will be encoded as \Uxxxxxx.
> >Encoding
> >> >0000 to FFFF as \U00xxxx is allowed but encoding 10000 to 10FFFF
> >characters
> >> >as a pair of \uxxxx\uxxxx surrogate codes should be strongly discouraged.
> >> >
> >> >I am developing code that I want to be ICU compatible, thus I would like
> >to
> >> >see this proposal adopted for ICU.
> >> >
> >> >Carl
> >> >
> >> >
> >> >_______________________________________________
> >> >icu mailing list
> >> >ic...@os...
> >>
> >>http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu
> >>
> >> _______________________________________________
> >> icu mailing list
> >> ic...@os...
> >> http://oss.software.ibm.com/developerworks/opensource/mailman/listinfo/icu

Re: Proposal: Unicode Hex representations

Open Source C/C++/Java libraries from Unicode

Re: Proposal: Unicode Hex representations