Rob Wissmann - 2013-03-31

Hello,

I am using libjson for a project and I have a few questions/concerns.

First, about thread safety.  I noticed in the docs that your preferred method of achieving thread safety is via the JSON_MUTEX_CALLBACKS mechanism.  It was not entirely clear to me where the issues of threadsafety would appear (would it be across access to a piece of contended memory like a json_node or would it be across all objects because the ref counting mechanism is global to all threads?) so I decided against implementing these callbacks to achieve thread safety and instead disabled JSON_REF_COUNT.  Is this sufficient to guarantee threadsafety for all of my usage of libjson?  Everything seems fine so far in testing but we haven't really hammered on things.

Secondly, about unicode.  I have observed that turning on JSON_UNICODE has the effect of changing json_char from a char to a wchar_t (not sure if it does anything else).  wchar_t is a poor choice of container type for unicode data.  wchar_t can change size from system to system.  On some systems it's 16 bit, on some it's 32 bit.  This means that users who write programs using libjson with JSON_UNICODE turned on will not have portable code.  It would be better to pick a container type with a particular bitness and a unicode encoding to go in it.  E.g. char with UTF-8 or uint16_t with UTF-16.  Users of libjson would then be guaranteed that everything would work and be portable as long as they provide properly encoded unicode inside the proper container.

If you wanted to get fancy, it's possible to auto-detect the encoding of the data that the user provides.  You could set libjson up to accept a buffer with a type agnostic pointer (like void *), autodetect the unicode encoding, and never transcode to a different format.  Users would get data back in the format they provided it.

JSON_ESCAPE_WRITES is unnecessarily limiting.  From what I have gathered from this thread, https://sourceforge.net/projects/libjson/forums/forum/1119661/topic/4800600, escapes are either on or off.  If JSON_ESCAPE_WRITES is on unicode and all control characters get escaped.  If it's off, nothing gets escaped.  In the thread I mentioned you state that not escaping unicode characters breaks the json standard.  This is not true.  From the json rfc (http://www.ietf.org/rfc/rfc4627.txt?number=4627):

2.5.  Strings

   The representation of strings is similar to conventions used in the C
   family of programming languages.  A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".

   Alternatively, there are two-character sequence escape
   representations of some popular characters.  So, for example, a
   string containing only a single reverse solidus character may be
   represented more compactly as "\\".

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ; \    reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; \

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters , it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Emphasis mine.  The RFC allows for unescaped unicode characters of any encoding.  As such, there ought to be a way to turn off unicode escapes but leave control character escapes turned on.

It would be more appropriate to break JSON_ESCAPE_WRITES into two options.  One option could control escaping unicode (since the majority of unescaped unicode is valid).  The other option could control the escaping of control characters that the json rfc defines as necessary to escape (quotation mark, reverse solidus and control characters) althought this would be a strange option to enable since it could cause non-conforming json to be written.

I would very much like to use libjson to parse utf-8 json without escaping any characters that don't need to be escaped.

The makefiles don't currently work on linux.  I hacked together some symlinks and edited a few makefiles to get everything to build but it would be really nice if the project could be re-organized a bit so that the linux build would work.

The int type you use should probably be int64_t (from stdint.h) for portability's sake.  From reading the RFC I gather that there is no limit to the size of an allowed integer.  Representing an arbitrarily large integer in C/C++ without involving a 3rd party library is a crappy problem.  int64_t is the biggest you can get and be portable.

One last thing, I'm wondering why a lot of the library options are compile time options instead of runtime options.  JSON_INDENT, JSON_ESCAPE_WRITES, JSON_NEWLINE, etc.  They could all be configured at runtime, allowing the library to be a bit more general purpose.

Frankly, after writing all this I'm starting to wonder if I should even be using this library at all.  It can read and write json but I had to do an awful lot of testing and legwork to get it to build and to figure out what its unicode support was doing.  And now that I have figured things out I'm not terribly happy with what I've got.