From: SourceForge.net <no...@so...> - 2007-09-20 19:38:25
|
Bugs item #1797418, was opened at 2007-09-18 14:47 Message generated for change (Comment added) made by qwertie You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101645&aid=1797418&group_id=1645 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: csharp Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: David Piepgrass (qwertie) Assigned to: William Fulton (wsfulton) Summary: C#: wchar_t should be marshalled as UnmanagedType.U2 Initial Comment: wchar_t should be marshalled as UnmanagedType.U2, not the default of 1 byte. ---------------------------------------------------------------------- >Comment By: David Piepgrass (qwertie) Date: 2007-09-20 13:38 Message: Logged In: YES user_id=171344 Originator: YES Unicode as 32-bit ints is inevitably wasteful even if you use the full Unicode range, which is only 20 bits. Further, AFAIK all "living language" characters fit in 16 bits. If wchar_t were only used to represent single characters it would be fine, but for strings (wstring) it's very wasteful. I guess it's better to use std::basic_string<unsigned short> rather than std::wstring, although on Windows wchar_t may be better because a debugger understands that it represents characters. As for whether to use 16-bit or 8-bit strings, it's a no-win situation. UTF-8 is inefficient for representing languages like Chinese, while UTF-16 is inefficient for European languages. And the minimum addressing boundary of our computers is 8 bits, so I'm afraid 5-bit character strings are out :P ---------------------------------------------------------------------- Comment By: Olly Betts (olly) Date: 2007-09-19 17:15 Message: Logged In: YES user_id=14972 Originator: NO A 32 bit type is the narrowest available integer type which can hold the full Unicode range, so I guess that's why it was chosen. Unicode as wide characters inevitably is wasteful if you don't actually use that range. Restricting to the BMP and using a 16 bit type is wasteful if you only have English text. If you only want upper case letters and 6 other characters, you only need 5 bits, so 8 bits per character is wasteful! Anyway, using a plain int sounds reasonable to me, but I don't really know the innards of C# - William's your man for that. ---------------------------------------------------------------------- Comment By: David Piepgrass (qwertie) Date: 2007-09-19 09:15 Message: Logged In: YES user_id=171344 Originator: YES Here's an idea: perhaps wchar_t should be marshalled as a plain int in the PINVOKE class, and the two wrappers can convert between char and wchar_t on each end. ---------------------------------------------------------------------- Comment By: David Piepgrass (qwertie) Date: 2007-09-19 09:13 Message: Logged In: YES user_id=171344 Originator: YES sizeof(wchar_t) is 4??? That's amazing to me. What a waste of memory. I'll be sure not to call my wide characters "wchar_t" if I get around to coding on Linux. Unfortunately, U4 doesn't work on Win32; the .NET framework throws an exception with a message saying 'char' can only marshal as U1, U2, I1 or I2. ---------------------------------------------------------------------- Comment By: Olly Betts (olly) Date: 2007-09-19 08:25 Message: Logged In: YES user_id=14972 Originator: NO On Linux at least, sizeof(wchar_t) is 4, so U2 will truncate characters outside the BMP. That's better than the current situation, but should this actually be U4? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=101645&aid=1797418&group_id=1645 |