RE: [GD-General] Unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I can't claim to be an expert by any means. I've just started digging into
it all myself. The actual implementation of Unicode support is extremely
compiler dependent
(http://oss.software.ibm.com/icu/docs/papers/unicode_wchar_t.html). GCC and
VC++ both have a data type declared wchar_t that you use for working with
unicode strings. A string literal is declared with a leading 'L':

wchar_t* str = L"This is my fancy string";

From what I understand so far, both compilers used fixed size for all
characters that are big enough to hold any code point. (GCC is 32-bit, and
VC++ is 16-bit). So pointer arithmetic and sizeof(whcar_t) are still
reliable.

There's lots of more info about the Unicode standard
http://www.unicode.org/standard/principles.html...

"Character encoding standards define not only the identity of each character
and its numeric value, or code point, but also how this value is represented
in bits.

The Unicode Standard defines three encoding forms that allow the same data
to be transmitted in a byte, word or double word oriented format (i.e. in 8,
16 or 32-bits per code unit). All three encoding forms encode the same
common character repertoire and can be efficiently transformed into one
another without loss of data. The Unicode Consortium fully endorses the use
of any of these encoding forms as a conformant way of implementing the
Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of
transforming all Unicode characters into a variable length encoding of
bytes. It has the advantages that the Unicode characters corresponding to
the familiar ASCII set have the same byte values as ASCII, and that Unicode
characters transformed into UTF-8 can be used with much existing software
without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access
to characters with economical use of storage. It is reasonably compact and
all the heavily used characters fit into a single 16-bit code unit, while
all other characters are accessible via pairs of 16-bit code units.

UTF-32 is popular where memory space is no concern, but fixed width, single
code unit access to characters is desired. Each Unicode character is
encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each
character."

-----Original Message-----
From: gam...@li...
[mailto:gam...@li...]On Behalf Of
Garett Bass
Sent: Wednesday, November 19, 2003 9:58 AM
To: gam...@li...
Subject: RE: [GD-General] Unicode

Paul,

	It was after reading Joel's article that I understood Unicode to use an
indeterminate number of bytes per character.  Specifically:

"In UTF-8, every code point from 0-127 is stored in a single byte. Only code
points 128 and above are stored using 2, 3, in fact, up to 6 bytes."

Which leaves me wondering, how do you figure out where one character ends
and the next begins?

Thanks in advance,
Garett

-----Original Message-----
From: gam...@li...
[mailto:gam...@li...]On Behalf Of
Paul Reynolds
Sent: Wednesday, November 19, 2003 11:31 AM
To: gam...@li...
Subject: RE: [GD-General] Feedback wanted on POSH

This is a pretty good overview of text encoding*:
http://www.joelonsoftware.com/articles/Unicode.html

I'd say everyone working on a shipping game should really evaluate if raw
char* strings are really a good idea. If you've ever had to localize a 7-bit
ascii game, you'll know what I'm talking about. Other software industries
have been embracing unicode for quite some time.

* - For the record, I'm not a Joel Spolsky fanboy. I can usually take him or
leave him. ;o)

-----Original Message-----
From: gam...@li...
[mailto:gam...@li...]On Behalf Of
Garett Bass
Sent: Wednesday, November 19, 2003 9:13 AM
To: gam...@li...
Subject: RE: [GD-General] Feedback wanted on POSH

// Crosbie Fitch wrote:
// Hmmn maybe the chars should be like this:

You will notice that POSH doesn't provide a char typedef, presumably because
sizeof(char) == 1 in ANSI C, as mentioned in another post.  I imagine that
defining your own integer character type will require an explicit cast
anytime you want to use a string manipulation function, which seems a little
awkward.  Of course, if you use C++ and STL, then you can always create a
std::basic_string<char_utf8>, or whatever.

// typedef char8 char_ascii; // Unsized char able to contain 7bit ASCII
// typedef char8 char_utf8;  // Unsized char able to contain...
// typedef char16 char_ucs2; // Unsized char able to contain...

I'm not sure I understand what you mean by "Unsized" here.  If you're
defining char8 to be uint8, then its size is 8 bits.

// typedef char_utf8 char_unicode; // Unsized char suitable for Unicode
// typedef char_unicode character; // Unsized char suitable for any text

Not being too familiar with unicode, I find this confusing.  I thought that
"Unicode" was a multibyte format with no set number of bytes per character,
ie. a single asian character may be represented by four bytes while the
subsequent character is represented by two.

Regards,
Garett

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Gamedevlists-general mailing list
Gam...@li...
https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_id=557

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Gamedevlists-general mailing list
Gam...@li...
https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_id=557

-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
Gamedevlists-general mailing list
Gam...@li...
https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_id=557