Proper way of using Unicode?

2011-11-08
2013-06-12
  • Stephen Chu
    Stephen Chu
    2011-11-08

    I just started experimenting libjson and have some questions on how to properly use Unicode strings.

    The strings in my project are in 16-bit UniChar (Apple's term) array format, with functions to convert to and from UTF-8 encoded data. Now when I enable JSON_UNICODE, all the interfaces now require wchar_t strings. My problem is, with Xcode on OS X, wchar_t is 32-bit char which is incompatible with my data. I can certainly convert my strings before passing to and after passed from libjson. But that's quite a performance hit since the code does this A LOT.

    The other issue is when JSON_UNICODE is enabled, both write functions now produce 4-byte strings which I need to convert back to 8-bit strings before sending it.

    Is there a way to use 8-bit, UTF-8 strings with libjson directly? I tried disabling JSON_UNICODE, but it pretty much just encode any non-ASCII bytes as \u(BYTEVALUE) which will cause the other end to treat it as a 16-bit BMP character. And that is of course incorrect.

    Thanks.

     
  • Hello,

    Yes, there is a way to use your utf strings in libjson directly.  json_string (which is what all strings use inside of libjson) is a typedef that can be altered by changing the JSON_STRING_HEADER option.  This allows you to declare your own string type. 

    My recommendation is to create a wrapper class around your 16-bit UniChar array.  As long as it conforms to the same interface as std::string, your class will work.  You do not need to fully implement the string interface (it's huge) you just need the subset that libjson uses.  You can find an example of a custom string in TestSuite/StringTest.h.  That is a fully custom string, and you can use the TestSuite to be sure that everything is coded correctly, it includes a string testing piece of code.

    If you need assistance doing this, let me know, I will surely help you.

     
  • Stephen Chu
    Stephen Chu
    2011-11-09

    Thanks for the help.

    I understand I can use whatever char type for the string. But the problem is I need to transmit and receive 8-bit, UTF-8 encoded data. In fact, now that I think of it, my internal representation of the text doesn't really matter. I can switch to 16-bit char with libjson, but that will require the input data to the parser to be in 16-bit string and the output data will also be in 16-bit. Both would require conversion in and out of my code. It's not that bad but we are dealing some quite sizeable data very frequently.

    What I really need, is to use 8-bit, UTF-8 encoded strings with libjson. Right now when using std:string as json string type, libjson has problem with UTF-8 data. For example, if I assign a node with a UTF-8 string that contain non-ASCII chars, it will later write such chars with "\uXXXX" encoding by treating individual UTF-8 bytes as if they are single Unicode code points.

    Take, for example, character Ü. The UTF-8 sequence is \xC3\x9C. When assigning a string containing this sequence to a libjson node, the resulting output turns into "\u00C3\u009C" which is à and a non-printable character. Maybe if the writer just leave these characters as they are?

    Again. Thanks for the help. I really appreciate that.

     
  • Oh, yes, you can make libjson not escape unicode characters.  It breaks the JSON standard doing so, but as long as you don't give it to servers that expect proper json, it will work fine.  Go into JSONOptions.h and comment out the JSON_ESCAPE_WRITES option.  This will leave your unicode as it is, instead of escaping it.

     
  • lijo
    lijo
    2012-04-24

    Hi,

    I am also facing the same issue.

    I am using libjson to construct a json string, which I am sending to a server which expects utf-8 encoded strings. My input string to the libjson is also encoded in utf-8(of type std::string). To properly encode the non-ascii characters in the output json string, I have commented the JSON_ESCAPE_WRITES in JSONOptions.h.

    But due to this, special chars like '\n' are not escaped in the json string and server is rejecting it.
    Is there any way to solve this issue, so that the unicode chars will be properly encoded and special chars will be escaped? Converting to wstring would be very costly, as I will also have to convert it back to utf-8 when sending to server.

    Any help would be highly appreciated.
    Thanks,
    lijo

     
  • You will have to inject your own string class into libjson.  libjson has an option called JSON_STRING_HEADER, define it to be a header where you define json_string to be whatever you want.  If you define it to be a utf-8 string, libjson will use it both internally and in the interface.  It just has to implement the correctly interface (a subset of STL std::string.)  There is documentation for the string interface in the pdf under the JSON_STRING_HEADER section.

     
  • lijo
    lijo
    2012-04-25

    Thank you very much for the prompt reply. I will try this option.

     
  • beniz
    beniz
    2013-04-07

    Hi, does someone have a UTF8 compatible string class to plug into libjson ? This is a definitive blocker for this otherwise good library. Thanks in advance!