Re: [GD-General] Unicode

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

wchar_t is platform dependent, but on the other hand you probably don't need
Unicode string literals at all. You can use ASCII-7 as source string and
read the (Unicode) translations in different languages of specific strings
from a file, so you can store internally the string in whatever UTF format
you want and still use only ASCII-7 in source code, something like:
String translatedUnicodeString =
      translator->translate( "File {0} not found", filename );

For simple UTF-8/16/32/ASCII-7 converter have a look at
http://catmother.sourceforge.net source package and it's source file
lang/UTFConverter.cpp. It's very simple and doesn't handle special cases
correctly like incorrect UTF-data as guided in the Unicode standard, but it
should serve as simple encoder/decoder. For more complete Unicode
implementation ICU is the weapon of choice, a very complete and very
high-quality library, but also very heavy-weight and probably overkill for a
typical game. (just my opinion of course)

Regards,
Jani

----- Original Message ----- 
From: "Paul Reynolds" <pa...@so...>
To: <gam...@li...>
Sent: Wednesday, November 19, 2003 9:01 PM
Subject: RE: [GD-General] Unicode

> I can't claim to be an expert by any means. I've just started digging into
> it all myself. The actual implementation of Unicode support is extremely
> compiler dependent
> (http://oss.software.ibm.com/icu/docs/papers/unicode_wchar_t.html). GCC
and
> VC++ both have a data type declared wchar_t that you use for working with
> unicode strings. A string literal is declared with a leading 'L':
>
> wchar_t* str = L"This is my fancy string";
>
> From what I understand so far, both compilers used fixed size for all
> characters that are big enough to hold any code point. (GCC is 32-bit, and
> VC++ is 16-bit). So pointer arithmetic and sizeof(whcar_t) are still
> reliable.
>
> There's lots of more info about the Unicode standard
> http://www.unicode.org/standard/principles.html...
>
> "Character encoding standards define not only the identity of each
character
> and its numeric value, or code point, but also how this value is
represented
> in bits.
>
> The Unicode Standard defines three encoding forms that allow the same data
> to be transmitted in a byte, word or double word oriented format (i.e. in
8,
> 16 or 32-bits per code unit). All three encoding forms encode the same
> common character repertoire and can be efficiently transformed into one
> another without loss of data. The Unicode Consortium fully endorses the
use
> of any of these encoding forms as a conformant way of implementing the
> Unicode Standard.
>
> UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of
> transforming all Unicode characters into a variable length encoding of
> bytes. It has the advantages that the Unicode characters corresponding to
> the familiar ASCII set have the same byte values as ASCII, and that
Unicode
> characters transformed into UTF-8 can be used with much existing software
> without extensive software rewrites.
>
> UTF-16 is popular in many environments that need to balance efficient
access
> to characters with economical use of storage. It is reasonably compact and
> all the heavily used characters fit into a single 16-bit code unit, while
> all other characters are accessible via pairs of 16-bit code units.
>
> UTF-32 is popular where memory space is no concern, but fixed width,
single
> code unit access to characters is desired. Each Unicode character is
> encoded in a single 32-bit code unit when using UTF-32.
>
> All three encoding forms need at most 4 bytes (or 32-bits) of data for
each
> character."
>
> -----Original Message-----
> From: gam...@li...
> [mailto:gam...@li...]On Behalf Of
> Garett Bass
> Sent: Wednesday, November 19, 2003 9:58 AM
> To: gam...@li...
> Subject: RE: [GD-General] Unicode
>
>
> Paul,
>
> It was after reading Joel's article that I understood Unicode to use an
> indeterminate number of bytes per character.  Specifically:
>
> "In UTF-8, every code point from 0-127 is stored in a single byte. Only
code
> points 128 and above are stored using 2, 3, in fact, up to 6 bytes."
>
> Which leaves me wondering, how do you figure out where one character ends
> and the next begins?
>
> Thanks in advance,
> Garett
>
>
> -----Original Message-----
> From: gam...@li...
> [mailto:gam...@li...]On Behalf Of
> Paul Reynolds
> Sent: Wednesday, November 19, 2003 11:31 AM
> To: gam...@li...
> Subject: RE: [GD-General] Feedback wanted on POSH
>
>
> This is a pretty good overview of text encoding*:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> I'd say everyone working on a shipping game should really evaluate if raw
> char* strings are really a good idea. If you've ever had to localize a
7-bit
> ascii game, you'll know what I'm talking about. Other software industries
> have been embracing unicode for quite some time.
>
> * - For the record, I'm not a Joel Spolsky fanboy. I can usually take him
or
> leave him. ;o)
>
> -----Original Message-----
> From: gam...@li...
> [mailto:gam...@li...]On Behalf Of
> Garett Bass
> Sent: Wednesday, November 19, 2003 9:13 AM
> To: gam...@li...
> Subject: RE: [GD-General] Feedback wanted on POSH
>
>
> // Crosbie Fitch wrote:
> // Hmmn maybe the chars should be like this:
>
> You will notice that POSH doesn't provide a char typedef, presumably
because
> sizeof(char) == 1 in ANSI C, as mentioned in another post.  I imagine that
> defining your own integer character type will require an explicit cast
> anytime you want to use a string manipulation function, which seems a
little
> awkward.  Of course, if you use C++ and STL, then you can always create a
> std::basic_string<char_utf8>, or whatever.
>
> // typedef char8 char_ascii; // Unsized char able to contain 7bit ASCII
> // typedef char8 char_utf8;  // Unsized char able to contain...
> // typedef char16 char_ucs2; // Unsized char able to contain...
>
> I'm not sure I understand what you mean by "Unsized" here.  If you're
> defining char8 to be uint8, then its size is 8 bits.
>
> // typedef char_utf8 char_unicode; // Unsized char suitable for Unicode
> // typedef char_unicode character; // Unsized char suitable for any text
>
> Not being too familiar with unicode, I find this confusing.  I thought
that
> "Unicode" was a multibyte format with no set number of bytes per
character,
> ie. a single asian character may be represented by four bytes while the
> subsequent character is represented by two.
>
> Regards,
> Garett
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> Does SourceForge.net help you be more productive?  Does it
> help you create better code?  SHARE THE LOVE, and help us help
> YOU!  Click Here: http://sourceforge.net/donate/
> _______________________________________________
> Gamedevlists-general mailing list
> Gam...@li...
> https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=557
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> Does SourceForge.net help you be more productive?  Does it
> help you create better code?  SHARE THE LOVE, and help us help
> YOU!  Click Here: http://sourceforge.net/donate/
> _______________________________________________
> Gamedevlists-general mailing list
> Gam...@li...
> https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=557
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> Does SourceForge.net help you be more productive?  Does it
> help you create better code?  SHARE THE LOVE, and help us help
> YOU!  Click Here: http://sourceforge.net/donate/
> _______________________________________________
> Gamedevlists-general mailing list
> Gam...@li...
> https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=557
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: SF.net Giveback Program.
> Does SourceForge.net help you be more productive?  Does it
> help you create better code?  SHARE THE LOVE, and help us help
> YOU!  Click Here: http://sourceforge.net/donate/
> _______________________________________________
> Gamedevlists-general mailing list
> Gam...@li...
> https://lists.sourceforge.net/lists/listinfo/gamedevlists-general
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=557