Menu

#30 multibyte code character sets support... (japanese, chin...)

open
nobody
5
2003-06-21
2003-06-21
Anonymous
No

I'd like to be able to read Unicode or Multibyte character
sets...

Multibyte character sets (MBCS) are an
alternative to Unicode for supporting character sets, like
Japanese and Chinese, that cannot be represented in a single byte.
If you are programming for an international market, consider using
either Unicode or MBCS, or enabling your program so you can build it
for either by changing a switch.

The most common MBCS
implementation is double-byte character sets (DBCS). Visual C++
in general, and MFC in particular, is fully enabled for
DBCS.

For samples, see the MFC source code
files.

For platforms used in markets whose languages use
large character sets, the best alternative to Unicode is MBCS. MFC
supports MBCS by using "internationalizable" data types and C run-
time functions. You should do the same in your code.

Under
MBCS, characters are encoded in either one or two bytes. In two-
byte characters, the first, or "lead-byte," signals that both it and the
following byte are to be interpreted as one character. The first byte
comes from a range of codes reserved for use as lead bytes. Which
ranges of bytes can be lead bytes depends on the code page in use.
For example, Japanese code page 932 uses the range 0x81 through
0x9F as lead bytes, but Korean code page 949 uses a different
range.

Consider all of the following in your MBCS
programming:

MBCS characters in the
environment
MBCS characters can appear in strings such as
file and directory names.

Editing operations
Editing
operations in MBCS applications should operate on characters,
not bytes. The caret should not split a character, the RIGHT
ARROW key should move right one character, and so on. Delete
should delete a character; Undo should reinsert it.

String
handling
In an application that uses MBCS, string handling
poses special problems. Characters of both widths are mixed in a
single string; therefore you must remember to check for lead
bytes.

Run-time library support
The C run-time library
and MFC support single-byte, MBCS, and Unicode programming.
Single-byte strings are processed with the str family of run- time
functions, MBCS strings are processed with corresponding _mbs
functions, and Unicode strings are processed with corresponding
wcs functions. MFC class member function implementations use
portable run-time functions that map, under the right
circumstances, to the normal str family of functions, the MBCS
functions, or the Unicode functions, as described in
"MBCS/Unicode portability."

MBCS/Unicode
portability
Using the header file TCHAR.H, you can build single-
byte, MBCS, and Unicode applications from the same sources.
TCHAR.H defines macros prefixed with _tcs , which map to str,
_mbs, or wcs functions, as appropriate. To build MBCS, define the
symbol _MBCS. To build Unicode, define the symbol _UNICODE. By
default, _MBCS is defined for MFC applications. For more
information, see Generic-Text Mappings in TCHAR.H.

Note
Behavior is undefined if you define both _UNICODE and
_MBCS.

The MBCTYPE.H and MBSTRING.H header files
define MBCS-specific functions and macros, which you may need in
some cases. For example, _ismbblead tells you whether a specific
byte in a string is a lead byte.

For international portability,
code your program with Unicode or multibyte character sets
(MBCS).

Discussion


Log in to post a comment.