Menu

Thread for JSON backend discussions

2016-11-03
2017-03-29
1 2 3 > >> (Page 1 of 3)
  • Charlie Zender

    Charlie Zender - 2016-11-03

    Creating this thread for comments suggestions on JSON backend. Original message posted to netCDF follows below.

     
  • Charlie Zender

    Charlie Zender - 2016-11-03

    Geeting All,

    A few weeks ago I requested recommendations to convert netCDF->JSON.
    Thank you for your suggestions. Unfortunately none worked for us.
    So we added a JSON backend to NCO's ncks which already had CDL and XML.

    Exporting JSON required more design decisions than CDL and XML.
    Right now we output NC_TYPEs of variables not attributes.
    We'll add a switch to output attributes as objects with types.
    That's the only unfinished feature currently on our list.

    We would like to receive any feedback on JSON output by the current NCO
    snapshot (4.6.2-beta01 and counting).
    We can make simple changes before finalizing 4.6.2, while larger changes
    will be made during development of 4.6.3.
    The JSON backend accepts the same switches (-m -M -v -g --hdn) as CDL/XML.
    Sample output from ncks --json in.nc and in_grp.nc is viewable at:
    http://dust.ess.uci.edu/tmp/in.json
    http://dust.ess.uci.edu/tmp/in_grp.json

    Please post specific suggestions and comments to
    https://sourceforge.net/p/nco/discussion/9829/thread/8c4d7e72
    to avoid using the netCDF email list.

    Charlie

     
  • Christopher Barker

    Taking a quick look, and a few thoughts

    I'd like to see it nested a bit more -- i.e. putting variables in an object:

    { "dimensions": {"Lat": 2,
                     "Lon": 4,
                     ....
                     },
      "variables": {"Lat": {"dims": ["Lat"],
                            "type": "double",
                            "long_name": "Latitude",
                            "units": "degrees_north",
                            "purpose": "Latitude paired with Longitude coordinate originally stored as -180 to 180.",
                            "data": [-45.0, 45.0]
                            },
                    "LatLon": {"dims": ["Lat","Lon"],
                               "type": "double",
                               "long_name": "2D variable originally stored on -180 to 180 longitude grid",
                               "units": "fraction",
                               "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
                               "data": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
                               }
                    ...
                    },
    }
    

    For multi-dimension arrays, I'd like to see nested arrays, rather than the flattened version:

    "LatLon": {"dims": ["Lat","Lon"],
               "type": "double",
               "long_name": "2D variable originally stored on -180 to 180 longitude grid",
               "units": "fraction",
               "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
               "data": [[0.0, 1.0, 2.0, 3.0],
                        [4.0, 5.0, 6.0, 7.0]
                        ]
               }
    

    I think it makes it easier to parse into proper n-d arrays in clients. Also easeir to read by
    hand, though if they are non-trivally small, then that may not matter much.

    And you can do three or more dimensions, too:

    "LatLon": {"dims": ["time", "Lat", "Lon"],
               "type": "double",
               "long_name": "2D variable originally stored on -180 to 180 longitude grid",
               "units": "fraction",
               "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
               "data": [[[ 0.0,  1.0,  2.0,  3.0],
                         [ 4.0,  5.0,  6.0,  7.0],
                         [ 8.0,  9.0, 10.0, 11.0]],
    
                        [[12.0, 13.0, 14.0, 15.0],
                         [16.0, 17.0, 18.0, 19.0],
                         [20.0, 21.0, 22.0, 23.0]]]
               }
    
     
  • Christopher Barker

    Looking at the group example now:

    Same thing, of course for the variables object and nested nd arrays.

    I'd also rather see a "groups" at the top level, and tehn a "root" group in groups. -- but that may differ too much from the current NCL, etc convensions.

    Otherwise, it's all good.

    BTW -- it would be nice to have a smaller example -- I expect you're testing against a lot of things, so need all this, but it would be a lot easier to see the structure if it was smaller.

    Also -- I can't see where dataset (and group) level attributes go -- I'd like to see them in an object as well:

    {"dimensions": {bunch of dimensions here},
      "variables": {bunch of variables here},
      "attributes": {"attribute1": "value of attr one",
                                     "attribute2": "value of attr two",
                                     ...
                                     }
       "groups": {bunch of groups here}
     }
    

    and group attributes similarly.

    alternatively, they could just be top level objects, but I liek this better -- feels cleaner, and less chance of name clashes (is anone going to name an attribute "dimensions" --- probably not, but still.

     
  • Charlie Zender

    Charlie Zender - 2016-11-03

    Thank you for your comments, Chris. Some quick reponses:
    1. We agree that bracketing rather than flattening arrays would be a nice feature. This is on the list for 4.6.3. It's probably too much code to change for 4.6.2. Brackets do affect readability. Whether brackets are use in multidimentsional arrays by default, or unrolled by default, is up for discussion. It will probably be a user controlled switch. I take it you want the default to be full brackets.
    2. Thanks for pointing out the naming inconsistency. We will change "attrs" to "attributes" and "group" to "groups" before 4.6.2.
    3. Whether to make a "variables" section parallel to "dimensions" and "attributes" and "groups" is up for discussion. Right now if it isn't in dimensions or attributes or groups then it's a variable. We could go either way. What do other people think? Chris's suggestion would be more like CDL, which would make it perhaps more intuitive to some people. Do we want JSON to look like that or should it be its own thing?
    4. Personally, I do not want to see a group explicitly labeled "root". If those who do feel strongly, please add some justification.

     
  • Charlie Zender

    Charlie Zender - 2016-11-04

    Changed "attrs" to "attributes" for group and global attributes in latest snapshot.

    Regarding "groups", I was mistaken above in saying its spelling needed modification. Spelling was already plural. Would need more arm-twisting to change how groups are currently done. Indentation of hierarchical braces is currently imperfect though not a release-blocker since whitespace is ignored.

    The major remaining design question on which I would like feedback is whether variables should be nested in a "variables" object as Chris suggests, or left at the top-level as currently implemented. What are the implications for downstream users/programs if variables are nested in an object or left as is?

     
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    My comments, separated by posts:

    Currently, the JSON format output is done "by hand", inserting print statments in the output. This is error prone and time consuming. A much better way is to use a JSON C library to do this. If done one time, then no need to manually tweak the format all the time.

    I did an evaluation of some JSON C libaries and Jansson looked the best for me: simple, no dependencies.

    http://www.json.org/
    http://www.digip.org/jansson/

    The JSON-HDF5 format I did uses Jansson

    https://github.com/pedro-vicente/json-hdf5

    For NCO, this would be done in the traversal functions, each time a group , or variable, is added, then
    a new json_t item is created (a JSON object, or JSON array, etc). Then, to output, it's just a matter
    of calling the Jansson print function

    Here's the API ref

    https://jansson.readthedocs.io/en/2.9/apiref.html

     
    • Charlie Zender

      Charlie Zender - 2016-11-04

      Thank you for your comments, Pedro.
      Jansson is overkill for our needs. While it looks well-crafted and documented, we need to dump JSON flexibly and robustly now. Should resources or volunteers become available, we could change our backend from hand-coded to Jansson-based. Right now our scope is limited to producing JSON that won't embarass us in the future, so getting a consensus on the format is most important. Optimized implementations can (should?) always be done later.

       
      • Christopher Barker

        Is NCO going to read JSON, too? That's where you'd get a real benifit from a library. But impliementiaton is up to you, of course :-)

         
        • Charlie Zender

          Charlie Zender - 2016-11-04

          We have no plans to read JSON in NCO. Primary use now is to convert netCDF metadata (NB: not data) to JSON format to feed databases that will parse it using standard Python libraries.

           
          • Christopher Barker

            Funny -- if I needed to get netCDF meta data into a DB with Python, I'd jsut use the Python netcdf lib....

            But those darn Web developers dont want to deal with installing complex scientific dependencies...

            But in the long run, it would be nice if nc_JSON was a two way street !

             
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    For dimensions

    {
      "dimensions": {
        "gds_crd": 8,
        "lat": 2,
    

    This is how it should be: netCDF dimensions is a JSON object with key "dimensions" and value a list JSON objects

    The JSON format should follow naturally the HDF5/netCDF hierarchy

    the root group is a main JSON object
    {
    then at the root we can have 4 things: another group, a list of dimensions for root, a list of variables for root, a list of attributes for root. Same goes for all other subgroups.

    This could be like, for each group

    {
     "dimensions": {bunch of dimensions here},
      "variables": {bunch of variables here},
      "attributes": {bunch of attributes here}
       "groups": {bunch of groups here}
     }
    

    What are the implications for downstream users/programs if variables are nested in an object or left as is?

    It's much easier for a program to obtain the JSON key called "variables" than like it is right now, all mixed: as it is now the program would have to parse all the objects and detect object "dimensions" only (that are the netCDF dimensions), not really a good way to do this.

     

    Last edit: Pedro Vicente 2016-11-04
    • Charlie Zender

      Charlie Zender - 2016-11-04

      OK, I am swayed by the arguments you and Chris make that variables are better nested in an object. We will switch to this method before 4.6.2 is finalized.

       
    • Christopher Barker

      netCDF dimensions is a JSON object with key "dimensions" and value a list JSON objects

      I"m not sure if this is a terminology thing or a disagreement, but the value should be a object, with teh keys the names of the dimensions, not a list. (which is how it is in Charlies prototype, I think)

      same for variables.

       

      Last edit: Christopher Barker 2016-11-04
      • Pedro Vicente

        Pedro Vicente - 2016-11-04

        @Chris
        yes, you are right, and that's what I meant too

        {
        "dimensions": {
        "gds_crd": 8,
        "lat": 2,

        the key is "dimensions" , the value is a JSON object. This JSON object has several JSON objects, each one has the key the name of the dimension (in JSON the key must be a string), and the value a JSON number

         
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    The format for attributes should be like it is

    "attributes": {
          "Conventions": "CF-1.0",
          "history": "History global attribute.\n"
        },
    

    and the same for variables

    "lon": {
          "dims": ["lon"],
          "type": "float",
          "data": [0.0, 90.0, 180.0, 270.0]
        }
    

    except that the attribues of the variable should be its own JSON object, like this

    "lon": {
          "dims": ["lon"],
          "type": "float",
          "attributes": {
               "Conventions": "CF-1.0",
               "history": "History global attribute.\n"
          },
          "data": [0.0, 90.0, 180.0, 270.0]
        },
    
     
    • Christopher Barker

      except that the attribues of the variable should be its own JSON object:

      I agree.

       
      • Charlie Zender

        Charlie Zender - 2016-11-04

        I am undecided about requiring variable attributes to be
        placed in an "attributes" object.

        Currently we do non-pedantic (untyped attributes) that are not
        objects. They are implicitly known to be attributes because their
        keys are not "type", "data", or "dims". A scalar variable with
        two attributes looks like this:

        "var_nm": {
           "type": "double",
           "some_string": "CF-1.0",
           "some_number": "73",
           "data": "3.141"
        }
        

        To dump information losslessly would require dumping attribute types
        (like we do with variable types), and thus attributes must be objects.
        I do think adding types to attributes should be optional (if un-typed
        then use JSON rules to classify as double, int, string).
        This is what I have in mind for a loss-less aka "pedantic" dump:

        "var_nm": {
           "type": "double",
           "some_string": {
            "type": "string",
            "data": "CF-1.0"
            }, 
           "some_number": {
            "type": "ushort",
            "data": "73"
            },
           "data": "3.141"
        }
        

        The suggestion that variable attributes (not just group/global
        attributes) be in an "attributes" object implies that
        lossless/pedantic dumps would grow in length to:

        "var_nm": {
           "type": "double",
           "attributes" : {
            "some_string": {
                     "type": "string",
                 "data": "CF-1.0"
            },
            "some_number": {
                 "type": "ushort",
                 "data": "73"
            },
           "data": "3.141"
        }
        

        Is that verbosity worth the price in readability?

         
        • Christopher Barker

          I think the separate object for attributes is orthoganal to the type of attributes.

          They are implicitly known to be attributes because their
          keys are not "type", "data", or "dims".

          which means that you can't name an attribute any of those -- is that already defined by netcdf as illegal? If not then we shouldn't make that requirement.

          also, similarly to the "variables" object -- it's just easier and cleaner to put all the attributes together.

          Example: in the Python netCDF4 pacakge -- the Variable objects have all the netcdf variable attributes as python object attributes -- this is nifty, but ends up being a pain -- now attributes need to be valid pyton identifiers, and there are potential clashes with other pyton attributes of the object. This works only because internally the netcdf attributes are stored sepearetely, and there is an API for accessing them directly if you need to. but it makes for more complicated client code, 'cause how you deal with an attribute depends on what it is. And fragile client code, because everything can eork fine until a user passes a wierd attribute name in.

          the goal shoudl be to support as much of netdff as possible, and to make things clear and well defined as possbile.

           
        • Christopher Barker

          Interesting -- I honestly had no idea attributes were typed -- I don't think I've ever seen one that was anything other than string. Nevertheless, they must exist, so should be supported.

          I do like the idea of the default type of an attribute being whatever the JSON type is. That gets us:

          string
          number
          boolean

          unfortunately, JSON makes no distiction between ints and floats -- they are all doubles (in Javascript anyway)

          Fro an attribute, it wouldn't be hard for a client to check if the value happened to be integral and make it an int if so.

          more rich typing would require a type key, yes. which would require another level of nesting

          I'm inclinded to go with:

          "var_nm": { "type": "double", "attributes" : { "some_string": "CF-1.0", "some_int": 73, "some_float": 3.145, "some_bool": true }, "dims": [time], "data": [3.141, 4.32, 7.65, ...]

          So you'd lose specific types in a round-trip. Is that important? Does it matter much if you start with a short int and get back a long int in the end?

          Are there any netcdf variable types that don't reasonably match to a JSON type?

          If so, I suppose we could optionally have an attribute value be an object with a type and data field -- though "more than one way to do it" is less than ideal for a spec.

          On the fence here

           
          • Christopher Barker

            BTW: how does CDL deal with typed attributes? If CDL doesn't fully handl eit, then we have a precident.

            I see in here:

            http://www.unidata.ucar.edu/software/netcdf/workshops/2011/utilities/CDL.html

            that "Attribute types may be indicated implicitly"

            so I think we are on solid ground -- and it sure does make it more compact and readable.

             

            Last edit: Christopher Barker 2016-11-04
            • Charlie Zender

              Charlie Zender - 2016-11-04

              You misconstrue the (admittedly vague) meaning of "indicated implicitly". There are CDL suffixes for each atomic data type. For double, int, and string the suffix is empty and is determined by quotes or the presence of a decimal point. Thus we could omit the "type" field for any double, int, or string, say, and print attributes as objects only if they were not double, int, or string. That would be perfectly consistent with CDL. See CDL dumps at http://dust.ess.uci.edu/tmp/in.cdl and in_grp.cdl

               
          • Charlie Zender

            Charlie Zender - 2016-11-04

            Thanks for clarifying what native JSON types are.
            netCDF does not have an atomic boolean type.
            All netCDF types can be mapped to string or number (with a penalty in size).

            As I said (or implied), I prefer un-typed attributes (let JSON map them to what it will) and typed variables (the netCDF type is supplied by default, though JSON is free to ignore it). A short is four times smaller than a double. It would be crazy to, by default, pass four times too much data in a scientific setting where variables are often GB in size. The plan is for NCO to have a normal mode that supplies type for variables not attributes, and to have an optional pedantic mode that supplies type information for variables and all attributes.

             
            • Christopher Barker

              near real time conversation! maybe not needed, but some more thoughts I already write here:

              Taking more of a look at CDL spec:

              http://www.unidata.ucar.edu/software/netcdf/netcdf/CDL-Syntax.html

              "Attribute information is represented by single values or arrays of values. For example, units is an attribute represented by a character array such as celsius. An attribute has an associated variable, a name, a data type, a length, and a value."

              so we need to support:

              {
                  "a string": "this is somestring",
                  "an_int": 5,
                  "a_float": 5.3,
                  "an array of ints": [1, 2, 3, 4],
                  "an array of floats": [1.1, 2.1, 3.4]
              }
              

              Should we make it a bit simipler by requiring that a value is Always a list:

              "a_float": [5.3],
              

              Note that there is a complication in implicitly interpreting numbers that happen to be integers as int type:

              2.0 would become 2

              It's lossless, so I think OK, but may be surprising to some.

              Interestingly, the Python json lib DOES parse 2 as an integer, and 2.0 as a float -- am I reading the JSON spec wrong? Or is Python's lib eing a l ittle "smarter" that the spec? Maybe other parsers will be similarly helpful

              Then there is:

              "The data type of an attribute in CDL is derived from the type of the value assigned to it."

              Which makes no sense -- CDL is a text format -- there is no type. So it must mean implied by the literal. I expect CDL makes a distiction between "2" and "2.0" just like most languages, but alas, JSON does not.

              "The netCDF library does not enforce any restrictions on netCDF names, so it is possible (though unwise) to define variables with names that are not valid CDL names. The names for the primitive data types are reserved words in CDL, so the names of variables, dimensions, and attributes must not be type names."

              But it deosn't say anything about using "variables" or "dimensions" for names. I say we keep netcdf_JSON as unrestricive as possible and ceratnly no more restrictive than CDL.

               
            • Christopher Barker

              As I said (or implied), I prefer un-typed attributes (let JSON map them to what it will) and typed variables (the netCDF type is supplied by default, though JSON is free to ignore it).

              I agree -- types are critical to variables -- and realtively low overhead compared to size of variables in general.

              The plan is for NCO to have a normal mode that supplies type for variables not attributes,

              I agree -- and I doubt anyone is going to miss the full-on typing of variable attributes.

              and to have an optional pedantic mode that supplies type information for variables and all attributes.
              

              I'm wary of that -- probably good to have , but make writting parsers harder. Maybe wait until there is demand?

               
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.