RE: [GD-General] Unicode
Brought to you by:
vexxed72
From: Paul R. <pa...@so...> - 2003-11-19 19:01:27
|
I can't claim to be an expert by any means. I've just started digging into it all myself. The actual implementation of Unicode support is extremely compiler dependent (http://oss.software.ibm.com/icu/docs/papers/unicode_wchar_t.html). GCC and VC++ both have a data type declared wchar_t that you use for working with unicode strings. A string literal is declared with a leading 'L': wchar_t* str = L"This is my fancy string"; From what I understand so far, both compilers used fixed size for all characters that are big enough to hold any code point. (GCC is 32-bit, and VC++ is 16-bit). So pointer arithmetic and sizeof(whcar_t) are still reliable. There's lots of more info about the Unicode standard http://www.unicode.org/standard/principles.html... "Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits. The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard. UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites. UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units. UTF-32 is popular where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32. All three encoding forms need at most 4 bytes (or 32-bits) of data for each character." -----Original Message----- From: gam...@li... [mailto:gam...@li...]On Behalf Of Garett Bass Sent: Wednesday, November 19, 2003 9:58 AM To: gam...@li... Subject: RE: [GD-General] Unicode Paul, It was after reading Joel's article that I understood Unicode to use an indeterminate number of bytes per character. Specifically: "In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes." Which leaves me wondering, how do you figure out where one character ends and the next begins? Thanks in advance, Garett -----Original Message----- From: gam...@li... [mailto:gam...@li...]On Behalf Of Paul Reynolds Sent: Wednesday, November 19, 2003 11:31 AM To: gam...@li... Subject: RE: [GD-General] Feedback wanted on POSH This is a pretty good overview of text encoding*: http://www.joelonsoftware.com/articles/Unicode.html I'd say everyone working on a shipping game should really evaluate if raw char* strings are really a good idea. If you've ever had to localize a 7-bit ascii game, you'll know what I'm talking about. Other software industries have been embracing unicode for quite some time. * - For the record, I'm not a Joel Spolsky fanboy. I can usually take him or leave him. ;o) -----Original Message----- From: gam...@li... [mailto:gam...@li...]On Behalf Of Garett Bass Sent: Wednesday, November 19, 2003 9:13 AM To: gam...@li... Subject: RE: [GD-General] Feedback wanted on POSH // Crosbie Fitch wrote: // Hmmn maybe the chars should be like this: You will notice that POSH doesn't provide a char typedef, presumably because sizeof(char) == 1 in ANSI C, as mentioned in another post. I imagine that defining your own integer character type will require an explicit cast anytime you want to use a string manipulation function, which seems a little awkward. Of course, if you use C++ and STL, then you can always create a std::basic_string<char_utf8>, or whatever. // typedef char8 char_ascii; // Unsized char able to contain 7bit ASCII // typedef char8 char_utf8; // Unsized char able to contain... // typedef char16 char_ucs2; // Unsized char able to contain... I'm not sure I understand what you mean by "Unsized" here. If you're defining char8 to be uint8, then its size is 8 bits. // typedef char_utf8 char_unicode; // Unsized char suitable for Unicode // typedef char_unicode character; // Unsized char suitable for any text Not being too familiar with unicode, I find this confusing. I thought that "Unicode" was a multibyte format with no set number of bytes per character, ie. a single asian character may be represented by four bytes while the subsequent character is represented by two. Regards, Garett ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Gamedevlists-general mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-general Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=557 ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Gamedevlists-general mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-general Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=557 ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ Gamedevlists-general mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-general Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=557 |