#22 Bug in UnfixString and sub calls related to utf-8

closed-works-for-me
nobody
None
5
2012-03-10
2011-12-06
Anonymous
No

Hello.

Here a snippet of code from json_string JSONWorker::UnfixString(const json_string & value_t, bool flag) json_nothrow {

if (json_unlikely(((json_uchar)(*p) < 32) || ((json_uchar)(*p) > 126))){
res += toUTF8((json_uchar)(*p));
} else {

Description of bug.
Assume you have a UTF-8 string of one symbol (to simplify example) with binary representation as D1 8F (it's "я" letter) then (json_uchar)(*p) > 126 is true because 0xD1 = 209 (in decimal). Instead of taking the necessarily amount of bytes (2 bytes in this particular case) you read it byte by byte and result is \u00D0\u008F instead of \u044F. Also, please, check json_string JSONWorker::toUTF8(json_uchar p).
There is a many ways to fix it.
The simplest one is to change it to
#ifndef JSON_UNICODE
if (json_unlikely(((json_uchar)(*p) < 32) || ((json_uchar)(*p) > 126))){
res += toUTF8((json_uchar)(*p));
} else {
#endif
res += *p;
#ifndef JSON_UNICODE
}
#endif
but I'm not sure that it's correct. Any way json format allows to use utf-8 (It's default encoding http://www.ietf.org/rfc/rfc4627\) if string is quated. If 8bits characters are used then I guess there is no reason to encode them.
Another ways are to use the third part libraries (ICU, UTF8CPP, MultiByteToWideChar in windows) or implement it by yourself to transform a string to encoded string like \uXXXX.... if it's required.
BTW, actually toUTF8 does not transform string to UTF-8 form, it transforms to string of code points of UTF-16.

Best regards,
Sergey

Discussion

  • Looking into it over the next few days.

     
  • libjson may not have any external dependencies. C++ lacks a native utf string class, however libjson has the option of plugging your own into it. See the JSON_STRING_HEADER option documentation. And I've tested string with the я character, it seems to be both read and written correctly.

     
    • status: open --> open-works-for-me
     
    • status: open-works-for-me --> closed-works-for-me