From: Markus S. <mar...@gm...> - 2005-10-28 17:02:36
|
This question more properly belongs on the Unicode mailing list - Unicode inherited many characters from legacy character sets that might not have been included otherwise. These here are "fullwidth ASCII" characters from East Asian character sets. East Asian characters tend to have much wider display glyphs than Latin ones, and used to be encoded in some legacy character sets with pairs of bytes instead of single bytes. The fullwidth ASCII characters simply correspond to double-byte clones of ASCII characters in such character sets. Whether you need or want to treat these the same as the regular ASCII characters, or whether you fold them to their equivalents, depends on what you do. If you use collation for sorting of text, they will already sort next to their normal cousins. If you just want to fold these, then simply do a range check and subtract 0xffe0 (I think). If you want to handle related cases as well, for example for loose identifier matching, you could use NFKC or NFKD normalization. See the Unicode Standard about identifiers, East Asian Width, compatibility characters, and see the annotations in the standard for these characters. markus On 10/28/05, shef <sh...@ya...> wrote: > I'm a bit confused by some western characters in > Japanese text. I have some text to process that > contains western letters and numbers that look like > their ASCII equivalents, but aren't coded that way. > For example: > > H (hex ff28, looks like capital H, hex 48) > 2 (hex ff12, looks like digit 2, hex 32) > ( (hex ff08, looks like an open parenthesis, > hex 28) > ... > I thought that the same letters in different languages > always mapped to the same unicode characters? How do I > map all of these characters to their ASCII-range > equivalents? -- Opinions expressed here may not reflect my company's positions unless otherwise noted. |