Re: [GD-General] Unicode
Brought to you by:
vexxed72
From: Jani K. <ka...@ga...> - 2003-11-19 19:46:55
|
wchar_t is platform dependent, but on the other hand you probably don't need Unicode string literals at all. You can use ASCII-7 as source string and read the (Unicode) translations in different languages of specific strings from a file, so you can store internally the string in whatever UTF format you want and still use only ASCII-7 in source code, something like: String translatedUnicodeString = translator->translate( "File {0} not found", filename ); For simple UTF-8/16/32/ASCII-7 converter have a look at http://catmother.sourceforge.net source package and it's source file lang/UTFConverter.cpp. It's very simple and doesn't handle special cases correctly like incorrect UTF-data as guided in the Unicode standard, but it should serve as simple encoder/decoder. For more complete Unicode implementation ICU is the weapon of choice, a very complete and very high-quality library, but also very heavy-weight and probably overkill for a typical game. (just my opinion of course) Regards, Jani ----- Original Message ----- From: "Paul Reynolds" <pa...@so...> To: <gam...@li...> Sent: Wednesday, November 19, 2003 9:01 PM Subject: RE: [GD-General] Unicode > I can't claim to be an expert by any means. I've just started digging into > it all myself. The actual implementation of Unicode support is extremely > compiler dependent > (http://oss.software.ibm.com/icu/docs/papers/unicode_wchar_t.html). GCC and > VC++ both have a data type declared wchar_t that you use for working with > unicode strings. A string literal is declared with a leading 'L': > > wchar_t* str = L"This is my fancy string"; > > From what I understand so far, both compilers used fixed size for all > characters that are big enough to hold any code point. (GCC is 32-bit, and > VC++ is 16-bit). So pointer arithmetic and sizeof(whcar_t) are still > reliable. > > There's lots of more info about the Unicode standard > http://www.unicode.org/standard/principles.html... > > "Character encoding standards define not only the identity of each character > and its numeric value, or code point, but also how this value is represented > in bits. > > The Unicode Standard defines three encoding forms that allow the same data > to be transmitted in a byte, word or double word oriented format (i.e. in 8, > 16 or 32-bits per code unit). All three encoding forms encode the same > common character repertoire and can be efficiently transformed into one > another without loss of data. The Unicode Consortium fully endorses the use > of any of these encoding forms as a conformant way of implementing the > Unicode Standard. > > UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of > transforming all Unicode characters into a variable length encoding of > bytes. It has the advantages that the Unicode characters corresponding to > the familiar ASCII set have the same byte values as ASCII, and that Unicode > characters transformed into UTF-8 can be used with much existing software > without extensive software rewrites. > > UTF-16 is popular in many environments that need to balance efficient access > to characters with economical use of storage. It is reasonably compact and > all the heavily used characters fit into a single 16-bit code unit, while > all other characters are accessible via pairs of 16-bit code units. > > UTF-32 is popular where memory space is no concern, but fixed width, single > code unit access to characters is desired. Each Unicode character is > encoded in a single 32-bit code unit when using UTF-32. > > All three encoding forms need at most 4 bytes (or 32-bits) of data for each > character." > > -----Original Message----- > From: gam...@li... > [mailto:gam...@li...]On Behalf Of > Garett Bass > Sent: Wednesday, November 19, 2003 9:58 AM > To: gam...@li... > Subject: RE: [GD-General] Unicode > > > Paul, > > It was after reading Joel's article that I understood Unicode to use an > indeterminate number of bytes per character. Specifically: > > "In UTF-8, every code point from 0-127 is stored in a single byte. Only code > points 128 and above are stored using 2, 3, in fact, up to 6 bytes." > > Which leaves me wondering, how do you figure out where one character ends > and the next begins? > > Thanks in advance, > Garett > > > -----Original Message----- > From: gam...@li... > [mailto:gam...@li...]On Behalf Of > Paul Reynolds > Sent: Wednesday, November 19, 2003 11:31 AM > To: gam...@li... > Subject: RE: [GD-General] Feedback wanted on POSH > > > This is a pretty good overview of text encoding*: > http://www.joelonsoftware.com/articles/Unicode.html > > I'd say everyone working on a shipping game should really evaluate if raw > char* strings are really a good idea. If you've ever had to localize a 7-bit > ascii game, you'll know what I'm talking about. Other software industries > have been embracing unicode for quite some time. > > * - For the record, I'm not a Joel Spolsky fanboy. I can usually take him or > leave him. ;o) > > -----Original Message----- > From: gam...@li... > [mailto:gam...@li...]On Behalf Of > Garett Bass > Sent: Wednesday, November 19, 2003 9:13 AM > To: gam...@li... > Subject: RE: [GD-General] Feedback wanted on POSH > > > // Crosbie Fitch wrote: > // Hmmn maybe the chars should be like this: > > You will notice that POSH doesn't provide a char typedef, presumably because > sizeof(char) == 1 in ANSI C, as mentioned in another post. I imagine that > defining your own integer character type will require an explicit cast > anytime you want to use a string manipulation function, which seems a little > awkward. Of course, if you use C++ and STL, then you can always create a > std::basic_string<char_utf8>, or whatever. > > // typedef char8 char_ascii; // Unsized char able to contain 7bit ASCII > // typedef char8 char_utf8; // Unsized char able to contain... > // typedef char16 char_ucs2; // Unsized char able to contain... > > I'm not sure I understand what you mean by "Unsized" here. If you're > defining char8 to be uint8, then its size is 8 bits. > > // typedef char_utf8 char_unicode; // Unsized char suitable for Unicode > // typedef char_unicode character; // Unsized char suitable for any text > > Not being too familiar with unicode, I find this confusing. I thought that > "Unicode" was a multibyte format with no set number of bytes per character, > ie. a single asian character may be represented by four bytes while the > subsequent character is represented by two. > > Regards, > Garett > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SF.net Giveback Program. > Does SourceForge.net help you be more productive? Does it > help you create better code? SHARE THE LOVE, and help us help > YOU! Click Here: http://sourceforge.net/donate/ > _______________________________________________ > Gamedevlists-general mailing list > Gam...@li... > https://lists.sourceforge.net/lists/listinfo/gamedevlists-general > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=557 |