From: Alexandre P. N. <ale...@gm...> - 2015-06-30 17:36:45
|
Grasp this sentence: implementation dependent. wchar_t is wide char, it's made to imply that each representable character can take more than one byte to encode a character. It was created before utf8 got mainstream. And, as there were competing encodings (UCS-2 fixed length vs what ended up being UTF-16 with variable length encoding), the standardization committees didn't pick one. While UTF8 is compatible with C encoding of strings (null termination), it uses variable length encoding (so strlen(buffer) doesn't work for utf8 as it would for asc-ii, because it counts bytes, not characters). It's otherwise a superset of ASC-II like most "national" encodings microsoft uses for their routines ending in "A" (as opposed to "W", for "Wide"). Only UTF32 has fixed length encoding (I overheard people saying it's not exactly true even there, because it has unrepresentable code points, but I never confirmed). Microsoft started using a fixed length 2-byte encoding (UCS-2 IIRC) which was a more or less subset of utf16, but it swapped it (with bugs being slowly fixed over the years) by utf16 because by the time utf16 was standardized, people had already realized that 16 bits doesn't cover every symbol in every language, thus utf16 can take more than one 16-bit sequence code to represent a visible character (a symbol). It was natural for Microsoft to standardize MSVC's wchar_t to 16bit since it supported it natively in it's revamped UCS2 (now UTF-16) API, but many other platforms have wchar_t as UTF32. Nowadays you have standardized converters between each of these encodings even in the C++ library, but as conversion is expensive (and so is storage, so sometimes you have to convert, depending on the data, i.e. long english pure text use the same amount of data in utf8 than it would use in plain ascii, and in fact, it would be the same: utf8 is an ASCII superset). I feel the pain, trust me: I have a program written in C++ that interfaces with windows (via 16-bit ucs-16), a huge third-party code base using UTF-32, and gtk (which uses utf8 for everything). If it wasn't so easy (well, it isn't, but you get it over time) to use type safety in C++, I would be passing the wrong string type around more times than I could count. I ended up using a pre-C++11 converter to adapt the strings on demand, but today it would be easier. Btw, I always swim against the mainstrean and when programming windows, I normally follow these advices (among others): http://utf8everywhere.org/ But I *can't* and I *won't* suggest you or anyone else do the same blindly; There are reasons pro and against these tips; For me it was a win, but I can see why some people/organizations would pay a price which is too high for a not worthy return. Em ter, 30 de jun de 2015 às 14:04, LRN <lr...@gm...> escreveu: > On 30.06.2015 19:44, pa...@ar... wrote: > > I have been reading that wchat_t, and therefore wstring, is neither > UTF-8 nor a UTF-16 character set. So, what is wstring good for then? > Whether it's UTF-16 or UCS-2 depends on the implementation of the library > that handles wstring. > > Sources, which i can't remember right now, claim that MS libraries were > UCS-2 initially, then later quietly converted to UTF-16 under the hood. > > > -- > O< ascii ribbon - stop html email! - www.asciiribbon.org > > ------------------------------------------------------------------------------ > Don't Limit Your Business. Reach for the Cloud. > GigeNET's Cloud Solutions provide you with the tools and support that > you need to offload your IT needs and focus on growing your business. > Configured For All Businesses. Start Your Cloud Today. > https://www.gigenetcloud.com/ > _______________________________________________ > Mingw-w64-public mailing list > Min...@li... > https://lists.sourceforge.net/lists/listinfo/mingw-w64-public > |