Re: [icu-support] Western chars in Asian text

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

This question more properly belongs on the Unicode mailing list -

Unicode inherited many characters from legacy character sets that
might not have been included otherwise. These here are "fullwidth
ASCII" characters from East Asian character sets. East Asian
characters tend to have much wider display glyphs than Latin ones, and
used to be encoded in some legacy character sets with pairs of bytes
instead of single bytes. The fullwidth ASCII characters simply
correspond to double-byte clones of ASCII characters in such character
sets.

Whether you need or want to treat these the same as the regular ASCII
characters, or whether you fold them to their equivalents, depends on
what you do. If you use collation for sorting of text, they will
already sort next to their normal cousins. If you just want to fold
these, then simply do a range check and subtract 0xffe0 (I think). If
you want to handle related cases as well, for example for loose
identifier matching, you could use NFKC or NFKD normalization.

See the Unicode Standard about identifiers, East Asian Width,
compatibility characters, and see the annotations in the standard for
these characters.

markus

On 10/28/05, shef <sh...@ya...> wrote:
> I'm a bit confused by some western characters in
> Japanese text. I have some text to process that
> contains western letters and numbers that look like
> their ASCII equivalents, but aren't coded that way.
> For example:
>
> Ｈ (hex ff28, looks like capital H, hex 48)
> ２ (hex ff12, looks like digit 2, hex 32)
> （ (hex ff08, looks like an open parenthesis,
> hex 28)
> ...
> I thought that the same letters in different languages
> always mapped to the same unicode characters? How do I
> map all of these characters to their ASCII-range
> equivalents?

--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.

Re: [icu-support] Western chars in Asian text

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] Western chars in Asian text