NCO netCDF Operators / Discussion / Open Discussion: Thread for JSON backend discussions

Charlie Zender - 2016-11-03

Creating this thread for comments suggestions on JSON backend. Original message posted to netCDF follows below.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2016-11-03

Geeting All,

A few weeks ago I requested recommendations to convert netCDF->JSON.
Thank you for your suggestions. Unfortunately none worked for us.
So we added a JSON backend to NCO's ncks which already had CDL and XML.

Exporting JSON required more design decisions than CDL and XML.
Right now we output NC_TYPEs of variables not attributes.
We'll add a switch to output attributes as objects with types.
That's the only unfinished feature currently on our list.

We would like to receive any feedback on JSON output by the current NCO
snapshot (4.6.2-beta01 and counting).
We can make simple changes before finalizing 4.6.2, while larger changes
will be made during development of 4.6.3.
The JSON backend accepts the same switches (-m -M -v -g --hdn) as CDL/XML.
Sample output from ncks --json in.nc and in_grp.nc is viewable at:
http://dust.ess.uci.edu/tmp/in.json
http://dust.ess.uci.edu/tmp/in_grp.json

Please post specific suggestions and comments to
https://sourceforge.net/p/nco/discussion/9829/thread/8c4d7e72
to avoid using the netCDF email list.

Charlie

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Taking a quick look, and a few thoughts

I'd like to see it nested a bit more -- i.e. putting variables in an object:

{ "dimensions": {"Lat": 2,
                 "Lon": 4,
                 ....
                 },
  "variables": {"Lat": {"dims": ["Lat"],
                        "type": "double",
                        "long_name": "Latitude",
                        "units": "degrees_north",
                        "purpose": "Latitude paired with Longitude coordinate originally stored as -180 to 180.",
                        "data": [-45.0, 45.0]
                        },
                "LatLon": {"dims": ["Lat","Lon"],
                           "type": "double",
                           "long_name": "2D variable originally stored on -180 to 180 longitude grid",
                           "units": "fraction",
                           "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
                           "data": [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
                           }
                ...
                },
}

For multi-dimension arrays, I'd like to see nested arrays, rather than the flattened version:

"LatLon": {"dims": ["Lat","Lon"],
           "type": "double",
           "long_name": "2D variable originally stored on -180 to 180 longitude grid",
           "units": "fraction",
           "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
           "data": [[0.0, 1.0, 2.0, 3.0],
                    [4.0, 5.0, 6.0, 7.0]
                    ]
           }

I think it makes it easier to parse into proper n-d arrays in clients. Also easeir to read by
hand, though if they are non-trivally small, then that may not matter much.

And you can do three or more dimensions, too:

"LatLon": {"dims": ["time", "Lat", "Lon"],
           "type": "double",
           "long_name": "2D variable originally stored on -180 to 180 longitude grid",
           "units": "fraction",
           "purpose": "Demonstrate remapping of [-180,180) to [0,360) longitude-grid data",
           "data": [[[ 0.0,  1.0,  2.0,  3.0],
                     [ 4.0,  5.0,  6.0,  7.0],
                     [ 8.0,  9.0, 10.0, 11.0]],

                    [[12.0, 13.0, 14.0, 15.0],
                     [16.0, 17.0, 18.0, 19.0],
                     [20.0, 21.0, 22.0, 23.0]]]
           }

Christopher Barker - 2016-11-03

Looking at the group example now:

Same thing, of course for the variables object and nested nd arrays.

I'd also rather see a "groups" at the top level, and tehn a "root" group in groups. -- but that may differ too much from the current NCL, etc convensions.

Otherwise, it's all good.

BTW -- it would be nice to have a smaller example -- I expect you're testing against a lot of things, so need all this, but it would be a lot easier to see the structure if it was smaller.

Also -- I can't see where dataset (and group) level attributes go -- I'd like to see them in an object as well:

{"dimensions": {bunch of dimensions here}, "variables": {bunch of variables here}, "attributes": {"attribute1": "value of attr one", "attribute2": "value of attr two", ... } "groups": {bunch of groups here} }

and group attributes similarly.

alternatively, they could just be top level objects, but I liek this better -- feels cleaner, and less chance of name clashes (is anone going to name an attribute "dimensions" --- probably not, but still.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2016-11-03

Thank you for your comments, Chris. Some quick reponses:
1. We agree that bracketing rather than flattening arrays would be a nice feature. This is on the list for 4.6.3. It's probably too much code to change for 4.6.2. Brackets do affect readability. Whether brackets are use in multidimentsional arrays by default, or unrolled by default, is up for discussion. It will probably be a user controlled switch. I take it you want the default to be full brackets.
2. Thanks for pointing out the naming inconsistency. We will change "attrs" to "attributes" and "group" to "groups" before 4.6.2.
3. Whether to make a "variables" section parallel to "dimensions" and "attributes" and "groups" is up for discussion. Right now if it isn't in dimensions or attributes or groups then it's a variable. We could go either way. What do other people think? Chris's suggestion would be more like CDL, which would make it perhaps more intuitive to some people. Do we want JSON to look like that or should it be its own thing?
4. Personally, I do not want to see a group explicitly labeled "root". If those who do feel strongly, please add some justification.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2016-11-04

Changed "attrs" to "attributes" for group and global attributes in latest snapshot.

Regarding "groups", I was mistaken above in saying its spelling needed modification. Spelling was already plural. Would need more arm-twisting to change how groups are currently done. Indentation of hierarchical braces is currently imperfect though not a release-blocker since whitespace is ignored.

The major remaining design question on which I would like feedback is whether variables should be nested in a "variables" object as Chris suggests, or left at the top-level as currently implemented. What are the implications for downstream users/programs if variables are nested in an object or left as is?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pedro Vicente - 2016-11-04

My comments, separated by posts:

Currently, the JSON format output is done "by hand", inserting print statments in the output. This is error prone and time consuming. A much better way is to use a JSON C library to do this. If done one time, then no need to manually tweak the format all the time.

I did an evaluation of some JSON C libaries and Jansson looked the best for me: simple, no dependencies.

http://www.json.org/
http://www.digip.org/jansson/

The JSON-HDF5 format I did uses Jansson

https://github.com/pedro-vicente/json-hdf5

For NCO, this would be done in the traversal functions, each time a group , or variable, is added, then
a new json_t item is created (a JSON object, or JSON array, etc). Then, to output, it's just a matter
of calling the Jansson print function

Here's the API ref

https://jansson.readthedocs.io/en/2.9/apiref.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2016-11-04
  
  Thank you for your comments, Pedro.
  Jansson is overkill for our needs. While it looks well-crafted and documented, we need to dump JSON flexibly and robustly now. Should resources or volunteers become available, we could change our backend from hand-coded to Jansson-based. Right now our scope is limited to producing JSON that won't embarass us in the future, so getting a consensus on the format is most important. Optimized implementations can (should?) always be done later.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Christopher Barker - 2016-11-04
    
    Is NCO going to read JSON, too? That's where you'd get a real benifit from a library. But impliementiaton is up to you, of course :-)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Charlie Zender - 2016-11-04
      
      We have no plans to read JSON in NCO. Primary use now is to convert netCDF metadata (NB: not data) to JSON format to feed databases that will parse it using standard Python libraries.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Christopher Barker - 2016-11-04
        
        Funny -- if I needed to get netCDF meta data into a DB with Python, I'd jsut use the Python netcdf lib....
        
        But those darn Web developers dont want to deal with installing complex scientific dependencies...
        
        But in the long run, it would be nice if nc_JSON was a two way street !
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pedro Vicente - 2016-11-04

For dimensions

{ "dimensions": { "gds_crd": 8, "lat": 2,

This is how it should be: netCDF dimensions is a JSON object with key "dimensions" and value a list JSON objects

The JSON format should follow naturally the HDF5/netCDF hierarchy

the root group is a main JSON object
{
then at the root we can have 4 things: another group, a list of dimensions for root, a list of variables for root, a list of attributes for root. Same goes for all other subgroups.

This could be like, for each group

{ "dimensions": {bunch of dimensions here}, "variables": {bunch of variables here}, "attributes": {bunch of attributes here} "groups": {bunch of groups here} }

What are the implications for downstream users/programs if variables are nested in an object or left as is?

It's much easier for a program to obtain the JSON key called "variables" than like it is right now, all mixed: as it is now the program would have to parse all the objects and detect object "dimensions" only (that are the netCDF dimensions), not really a good way to do this.

Last edit: Pedro Vicente 2016-11-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2016-11-04
  
  OK, I am swayed by the arguments you and Chris make that variables are better nested in an object. We will switch to this method before 4.6.2 is finalized.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Christopher Barker - 2016-11-04
  
  netCDF dimensions is a JSON object with key "dimensions" and value a list JSON objects
  
  I"m not sure if this is a terminology thing or a disagreement, but the value should be a object, with teh keys the names of the dimensions, not a list. (which is how it is in Charlies prototype, I think)
  
  same for variables.
  
  Last edit: Christopher Barker 2016-11-04
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Pedro Vicente - 2016-11-04
    
    @Chris
    yes, you are right, and that's what I meant too
    
    {
    "dimensions": {
    "gds_crd": 8,
    "lat": 2,
    
    the key is "dimensions" , the value is a JSON object. This JSON object has several JSON objects, each one has the key the name of the dimension (in JSON the key must be a string), and the value a JSON number
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pedro Vicente - 2016-11-04

The format for attributes should be like it is

"attributes": { "Conventions": "CF-1.0", "history": "History global attribute.\n" },

and the same for variables

"lon": { "dims": ["lon"], "type": "float", "data": [0.0, 90.0, 180.0, 270.0] }

except that the attribues of the variable should be its own JSON object, like this

"lon": { "dims": ["lon"], "type": "float", "attributes": { "Conventions": "CF-1.0", "history": "History global attribute.\n" }, "data": [0.0, 90.0, 180.0, 270.0] },
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Christopher Barker - 2016-11-04
  
  except that the attribues of the variable should be its own JSON object:
  
  I agree.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2016-11-04
    
    I am undecided about requiring variable attributes to be
    placed in an "attributes" object.
    
    Currently we do non-pedantic (untyped attributes) that are not
    objects. They are implicitly known to be attributes because their
    keys are not "type", "data", or "dims". A scalar variable with
    two attributes looks like this:
    
    "var_nm": { "type": "double", "some_string": "CF-1.0", "some_number": "73", "data": "3.141" }
    
    To dump information losslessly would require dumping attribute types
    (like we do with variable types), and thus attributes must be objects.
    I do think adding types to attributes should be optional (if un-typed
    then use JSON rules to classify as double, int, string).
    This is what I have in mind for a loss-less aka "pedantic" dump:
    
    "var_nm": { "type": "double", "some_string": { "type": "string", "data": "CF-1.0" }, "some_number": { "type": "ushort", "data": "73" }, "data": "3.141" }
    
    The suggestion that variable attributes (not just group/global
    attributes) be in an "attributes" object implies that
    lossless/pedantic dumps would grow in length to:
    
    "var_nm": { "type": "double", "attributes" : { "some_string": { "type": "string", "data": "CF-1.0" }, "some_number": { "type": "ushort", "data": "73" }, "data": "3.141" }
    
    Is that verbosity worth the price in readability?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Christopher Barker - 2016-11-04
      
      I think the separate object for attributes is orthoganal to the type of attributes.
      
      They are implicitly known to be attributes because their
      keys are not "type", "data", or "dims".
      
      which means that you can't name an attribute any of those -- is that already defined by netcdf as illegal? If not then we shouldn't make that requirement.
      
      also, similarly to the "variables" object -- it's just easier and cleaner to put all the attributes together.
      
      Example: in the Python netCDF4 pacakge -- the Variable objects have all the netcdf variable attributes as python object attributes -- this is nifty, but ends up being a pain -- now attributes need to be valid pyton identifiers, and there are potential clashes with other pyton attributes of the object. This works only because internally the netcdf attributes are stored sepearetely, and there is an API for accessing them directly if you need to. but it makes for more complicated client code, 'cause how you deal with an attribute depends on what it is. And fragile client code, because everything can eork fine until a user passes a wierd attribute name in.
      
      the goal shoudl be to support as much of netdff as possible, and to make things clear and well defined as possbile.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Christopher Barker - 2016-11-04
      
      Interesting -- I honestly had no idea attributes were typed -- I don't think I've ever seen one that was anything other than string. Nevertheless, they must exist, so should be supported.
      
      I do like the idea of the default type of an attribute being whatever the JSON type is. That gets us:
      
      string
      number
      boolean
      
      unfortunately, JSON makes no distiction between ints and floats -- they are all doubles (in Javascript anyway)
      
      Fro an attribute, it wouldn't be hard for a client to check if the value happened to be integral and make it an int if so.
      
      more rich typing would require a type key, yes. which would require another level of nesting
      
      I'm inclinded to go with:
      
      "var_nm": { "type": "double", "attributes" : { "some_string": "CF-1.0", "some_int": 73, "some_float": 3.145, "some_bool": true }, "dims": [time], "data": [3.141, 4.32, 7.65, ...]
      
      So you'd lose specific types in a round-trip. Is that important? Does it matter much if you start with a short int and get back a long int in the end?
      
      Are there any netcdf variable types that don't reasonably match to a JSON type?
      
      If so, I suppose we could optionally have an attribute value be an object with a type and data field -- though "more than one way to do it" is less than ideal for a spec.
      
      On the fence here
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Christopher Barker - 2016-11-04
        
        BTW: how does CDL deal with typed attributes? If CDL doesn't fully handl eit, then we have a precident.
        
        I see in here:
        
        http://www.unidata.ucar.edu/software/netcdf/workshops/2011/utilities/CDL.html
        
        that "Attribute types may be indicated implicitly"
        
        so I think we are on solid ground -- and it sure does make it more compact and readable.
        
        Last edit: Christopher Barker 2016-11-04
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Charlie Zender - 2016-11-04
        
        You misconstrue the (admittedly vague) meaning of "indicated implicitly". There are CDL suffixes for each atomic data type. For double, int, and string the suffix is empty and is determined by quotes or the presence of a decimal point. Thus we could omit the "type" field for any double, int, or string, say, and print attributes as objects only if they were not double, int, or string. That would be perfectly consistent with CDL. See CDL dumps at http://dust.ess.uci.edu/tmp/in.cdl and in_grp.cdl
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Charlie Zender - 2016-11-04
        
        Thanks for clarifying what native JSON types are.
        netCDF does not have an atomic boolean type.
        All netCDF types can be mapped to string or number (with a penalty in size).
        
        As I said (or implied), I prefer un-typed attributes (let JSON map them to what it will) and typed variables (the netCDF type is supplied by default, though JSON is free to ignore it). A short is four times smaller than a double. It would be crazy to, by default, pass four times too much data in a scientific setting where variables are often GB in size. The plan is for NCO to have a normal mode that supplies type for variables not attributes, and to have an optional pedantic mode that supplies type information for variables and all attributes.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Christopher Barker - 2016-11-04
        
        near real time conversation! maybe not needed, but some more thoughts I already write here:
        
        Taking more of a look at CDL spec:
        
        http://www.unidata.ucar.edu/software/netcdf/netcdf/CDL-Syntax.html
        
        "Attribute information is represented by single values or arrays of values. For example, units is an attribute represented by a character array such as celsius. An attribute has an associated variable, a name, a data type, a length, and a value."
        
        so we need to support:
        
        { "a string": "this is somestring", "an_int": 5, "a_float": 5.3, "an array of ints": [1, 2, 3, 4], "an array of floats": [1.1, 2.1, 3.4] }
        
        Should we make it a bit simipler by requiring that a value is Always a list:
        
        "a_float": [5.3],
        
        Note that there is a complication in implicitly interpreting numbers that happen to be integers as int type:
        
        2.0 would become 2
        
        It's lossless, so I think OK, but may be surprising to some.
        
        Interestingly, the Python json lib DOES parse 2 as an integer, and 2.0 as a float -- am I reading the JSON spec wrong? Or is Python's lib eing a l ittle "smarter" that the spec? Maybe other parsers will be similarly helpful
        
        Then there is:
        
        "The data type of an attribute in CDL is derived from the type of the value assigned to it."
        
        Which makes no sense -- CDL is a text format -- there is no type. So it must mean implied by the literal. I expect CDL makes a distiction between "2" and "2.0" just like most languages, but alas, JSON does not.
        
        "The netCDF library does not enforce any restrictions on netCDF names, so it is possible (though unwise) to define variables with names that are not valid CDL names. The names for the primitive data types are reserved words in CDL, so the names of variables, dimensions, and attributes must not be type names."
        
        But it deosn't say anything about using "variables" or "dimensions" for names. I say we keep netcdf_JSON as unrestricive as possible and ceratnly no more restrictive than CDL.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Christopher Barker - 2016-11-04
        
        As I said (or implied), I prefer un-typed attributes (let JSON map them to what it will) and typed variables (the netCDF type is supplied by default, though JSON is free to ignore it).
        
        I agree -- types are critical to variables -- and realtively low overhead compared to size of variables in general.
        
        The plan is for NCO to have a normal mode that supplies type for variables not attributes,
        
        I agree -- and I doubt anyone is going to miss the full-on typing of variable attributes.
        
        and to have an optional pedantic mode that supplies type information for variables and all attributes.
        
        I'm wary of that -- probably good to have , but make writting parsers harder. Maybe wait until there is demand?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thread for JSON backend discussions

Command-line operators for netCDF and HDF files

Forums

Help

Thread for JSON backend discussions

Thread for JSON backend discussions

Command-line operators for netCDF and HDF files

Forums

Help

Thread for JSON backend discussions document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Thread for JSON backend discussions