Menu

UTF-8 and wchar_t

2014-11-06
2014-11-06
  • Martin Törnros

    Martin Törnros - 2014-11-06

    Hi,

    I'm using cJSON to parse an UTF-8 without BOM json file in a VS2012 C++ project. However I am not able to parse the full character set properly, i.e. Cyrillic characters (above decimal code point 256) are parsed as decimal_code_point%256.

    To me, it seems that the char only stores 8-bit, hence it can't store the full character set. Can I specify the cJSON library to return "larger characters", e.g. char16_t or wchar_t? Or perhaps change a setting in VS2012 to always use 16-bit characters?

    I have tried setting the VS2012 Project Character Set to Unicode, Multi-Byte and Not Set. I have also tried saving the involved .h and .cpp files as Unicode, Codepage 65001. None of this seem to affect the parsing.

    Thanks for your help!

    Best regards,
    Martin

     

    Last edit: Martin Törnros 2014-11-06
  • Dave Gamble

    Dave Gamble - 2014-11-06

    I'm using cJSON to parse an UTF-8 without BOM json file in a VS2012 C++ project.

    You basically can't have a BOM in UTF-8 JSON. You're just not entitled to it, and it doesn't mean anything. It would be injected garbage. As you probably know, the very legitimacy of a BOM in UTF-8 is heavily contested.

    However I am not able to parse the full character set properly, i.e. Cyrillic characters (above decimal code point 256) are parsed as decimal_code_point%256.

    Ok, so what I think you're misunderstanding is really:
    1. How Unicode works
    2. What a JSON parser's feasible level of involvement in the process is.

    You have a fair bit of reading ahead, but let me save you some time.

    1. All JSON is UTF-8. This is mandated by specification. If it's not UTF-8, it's not JSON.
    2. All strings received by a JSON parser will be presented to you, the user, as UTF-8 strings.
    3. Your specific application appears to require conversion from UTF-8 to, say, UTF-16. Whilst you might not realise it, this is an entirely arbitrary conversion that makes sense to you, but not in any objective specified sense.
    4. Conversion from UTF-8 to UTF-16 is not a job for a JSON parser, it's a job for a Unicode library, because it's actually quite involved to do properly. There ARE quick hacks for this, and if your particular character set falls inside the range where the hack works, then good for you. But it's a hack, nothing more. The full spec is complex.

    On the plus side, since you're using windows, you can just call MultiByteToWideChar which IS a UTF-8 to UTF-16 convertor. The library you need here is built into windows.

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

    To me, it seems that the char only stores 8-bit,

    You're absolutely right. It's a UTF-8 implementation. As specified.
    And a char is 8 bit. That's kinda a thing.

    hence it can't store the full character set.

    HAHAHAHAHAHAHA!!!! Nor can a wchar! Unicode has MILLIONS of defined codepoints! Try getting that into a single 16bit wchar!
    Seriously, you should actually learn how unicode works. If you're using unicode for something serious, you will need to understand it. And if you're going to be using UTF-16, you'll need to understand surrogates, or nothing will make any sense.

    http://www.joelonsoftware.com/articles/Unicode.html

    Read that ^. That's the ABSOLUTE MINIMUM you need to know. After that, digest all of the Wikipedia coverage.

    Can I specify the cJSON library to return "larger characters", e.g. char16_t or wchar_t?

    NO. You need UTF16 encoded data. That means you need a UTF-8 to UTF-16 convertor. It's FAR more complicated than you imagine.

    Or perhaps change a setting in VS2012 to always use 16-bit characters?

    NO. Your data is in the wrong format for what you want. JSON +IS+ UTF-8. This library lives and breathes UTF-8. Everything is correct. You just don't realise that you need a conversion step.

    I have tried setting the VS2012 Project Character Set to Unicode, Multi-Byte and Not Set. I have also tried saving the involved .h and .cpp files as Unicode, Codepage 65001. None of this seem to affect the parsing.

    No, of course not. Though the idea of changing the encoding of the sourcecode is unintentionally hilarious, and needs to go on a website somewhere.

    Dave.

     

    Last edit: Dave Gamble 2014-11-06
  • Martin Törnros

    Martin Törnros - 2014-11-06

    Thanks Dave for your informative - but slightly arrogant - answer. The conversion from UTF-8 to UTF-16 works, thanks!

    I was a bit confused over this following statement that the char holds unsigned 16-bit (2-byte) code points http://msdn.microsoft.com/en-us/library/7sx7t66b(v=vs.110).aspx which made me believe that I could - somewhere - define VS2012 to always use 16 bits in a char, instead of 8.

    However, I am still slightly confused over the following: UTF-8 is variable length with 8-bit code units, which is - if I understand things right - why it can encode up to 4 byte characters and the complete Unicode set. I understand that I need a conversion to e.g. wchar_t for my Cyrillic characters, but I don't understand why a conversion to UTF-16 is necessary, if UTF-8 can in fact encode 4 bytes.

    I already read the articles that you've mentioned (thanks!) but will do so again now that things are slightly clearer.

    Best,
    Martin

     
  • Dave Gamble

    Dave Gamble - 2014-11-06

    Thanks Dave for your informative - but slightly arrogant - answer.

    I wasn't being arrogant, I was being rude.
    I wasn't self-aggrandising, I was chastising you for not having done enough research on your own before submitting a support request.

    Showing my age, evidently, but the internet once had an etiquette that one would reach for support only once one had achieved a detailed understanding of the topic under discussion and remained stuck.

    The conversion from UTF-8 to UTF-16 works, thanks!

    Ace!

    I was a bit confused over this following statement that the char holds unsigned 16-bit (2-byte) code points http://msdn.microsoft.com/en-us/library/7sx7t66b(v=vs.110).aspx

    The title of that page is "Char Data Type (Visual Basic)". Again, I am chastising you, not being arrogant.

    However, I am still slightly confused over the following: UTF-8 is variable length with 8-bit code units, which is - if I understand things right - why it can encode up to 4 byte characters and the complete Unicode set.

    Correct.

    I understand that I need a conversion to e.g. wchar_t for my Cyrillic characters, but I don't understand why a conversion to UTF-16 is necessary, if UTF-8 can in fact encode 4 bytes.

    Well, this is a very reasonable question, and not one I'm an expert in, but I'll tell you what I believe to be true.

    Historically we went from 7-bit defined ASCII to codepages, but Windows as a platform was expanding into territories where an 8bit character set was totally inadequate. Rather than have to handle the possibility that strlen()!=num_glyphs, the direction they went was a 16bit wchar, which allowed them scope to handle a lot more cases, particularly far-eastern glyph sets. The Unicode consortium assembled in recognition that the 16bit wchar strategy wasn't a workable long-term, universal solution. So they built up unicode as a spec around codepoints, and then did their best to provide a range of encodings that could be integrated with what was actually around. It did take quite a few years before unicode became the universal undisputed standard for the expression of glyphs (and, in truth, there's stilly plenty of other encoded text around).

    It seems (and this is a question for a proper expert) that Microsoft's uptake of UTF16 support evolved alongside UTF16, and the layout of the glyphs attempts to correspond to some extent with what Microsoft had already implemented with wchars. Now, there's a critical difference between UTF16 and wchar, which is that in UTF16, num_glyphs does not strictly equal strlen (surrogates), whilst I believe that early microsoft implementations took every wchar as a glyph. So there was a limited set of glyphs, but it was a massive improvement on trying to get by with ascii! I think the mindset was that even to this day, it's /largely/ true that num_glyphs==strlen, although there are "rare" exceptions. So you can 'imagine' wchar as just being a char that handles "all" the characters. This is a dangerous and leaky abstraction, of course, but it'll probably get you through 99% of cases.

    So, given that Microsoft /had/ support for something that looked like a wchar, and that it allowed 99% of stuff to work ok, they probably kept Unicode support in the Win32 API as UTF16 only because that's where the code mostly already was.

    TL;DR: Win32 doesn't support UTF8. That's why you have to convert to UTF16, which it does support, for what seem to be historical reasons.

     

    Last edit: Dave Gamble 2014-11-06

Log in to post a comment.