Menu

Thread for JSON backend discussions

2016-11-03
2017-03-29
<< < 1 2 3 > >> (Page 2 of 3)
  • Charlie Zender

    Charlie Zender - 2016-11-04

    Unless I hear otherwise, we have reached consensus on whether/when to supply attribute types. The pedantic attribute types option will not be in 4.6.2. 4.6.2-beta already implements the "normal mode" described above.

    I am against bracketing scalar variable (or attribute) values. I see no point in making a list out of a scalar. Yes the lack of brackets may make the parsing a little harder. Values corresponding to a dimension size of 1 would be (and are) bracketed, however.

    This shows what the CDL literals look like for a variety of attribute types:

    zender@firn:~$ ncks -C -v att_var --cdl ~/nco/data/in.nc 
    netcdf in {
      dimensions:
        time = UNLIMITED ; // (10 currently)
    
      variables:
        float att_var(time) ;
          att_var:byte_att = 0b, 1b, 2b, 127b, -128b, -127b, -2b, -1b ;
          att_var:char_att = "Sentence one.\n",
            "Sentence two.\n" ;
          att_var:short_att = 37s ;
          att_var:int_att = 73 ;
          att_var:long_att = 73 ;
          att_var:float_att = 73.f, 72.f, 71.f, 70.01f, 69.001f, 68.01f, 67.01f ;
          att_var:double_att = 73., 72., 71., 70.01, 69.001, 68.01, 67.010001 ;
    
      data:
        att_var = 10, 10.1, 10.2, 10.3, 10.40101, 10.5, 10.6, 10.7, 10.8, 10.99 ;
    
    } // group /
    
     
  • Christopher Barker

    @Charlie Zender wrote:
    """
    1. We agree that bracketing rather than flattening arrays would be a nice feature. This is on the list for 4.6.3. It's probably too much code to change for 4.6.2. Brackets do affect readability. Whether brackets are use in multidimentsional arrays by default, or unrolled by default, is up for discussion. It will probably be a user controlled switch. I take it you want the default to be full brackets.
    """

    Please no! There should be one way to represent nd arrays in JSON. I would strongly suggeset that we decide what that way should be based on consensus about the best design, and that we don't use the NCO release schedule to decide what's best.

    I would suggest that you don't release anything that does it another way, but if you want to call it a protoype that may change -- I suppose it's good to get tools out there so people can try it out.
    (I was planning on writting a Python netcdf <=> json convert to ackomlish that, but you're on a roll :-) )

    And yes, I vote for nested, but could live with flat -- there are arguments eithier way. In practice it's probably easier to write flat, but easier to parse nested.

    Also -- while I think it's a great idea to have the JSON nnicely indended and layed out for human readablily, that is a secondary concern -- you can improve that later without breaking anything.

    One thing to keep in mind -- JSON is an established standard, so when deciding whether to nest brackets in nd arrays, you are not decided how it gets formated, but how the data is structured.

    If we go with flat, then clients are going to need to do the stride arithmetic to get a nd-array that can be indexed. if you use nested, then you have nested JSON lists (which map to nested arrays, lists, etc in other languages, so you can do:

    val = variable['data'][i][j][k]

    Directly. In practice, I suppose clients will probably do the nesting for you when reading. In anything other than Javascript, it will have to be converted to native data types somehow anyway. (though that is likely to happen automagically in a JSON-parsing lib).

    As I write this, Im getting a stronger opinion -- nested is better!

     
    • Charlie Zender

      Charlie Zender - 2016-11-04

      I agree that nested brackets are better for multiple reasons, and should be the default. However, I think that writing unrolled (flat) arrays is fine, too. The "dims" list makes parsing unambiguous and future-proof. Moreover, degenerate (size 1) dimensions can inflate "data" sections with too many brackets, and so un-rolled syntax may be preferable in some cases. Why not require well-behaved readers to support both since both have ample reasons to be used?

       
      • Christopher Barker

        every "option" makes more work for the reader -- probably not that much work in this case, (at least in python :-) ), but if you have both opotions, then the reading code HAS to check the dimensions and do the right thing, so you've lost the benifit.

        But most importantly, when people write readers, they are often lazy and/or poorly informed. (lazyness is a virtue in programming:) ) -- so it is very likely that rather than studying any kind of spec, they will grab a sample file or two, and make their reader work with those examples.

        And then it will Barf somewhere down the road on a optional feature.

        I think the issue with specifying the data type in attributes is probably worth the trade-off -- though I expect many reasders won't read pedantic mode at all (at least at first) -- because adding the type really does uglify the JSON a lot.

        But in this case, the only downside I see for nested lists is:

        degenerate (size 1) dimensions can inflate "data" sections with too many brackets,

        Sure -- [ [ [1.0] ] ] is pretty ugly, but not that big a deal -- and JSON is mostly for machine reading anyway. And file bloat is not big deal, with all the overhead of JSON, it's not going to make or break anything.

        Anyway -- we seem to be converging on a complete spec -- nice work!

         
  • Christopher Barker

    The major remaining design question on which I would like feedback is whether variables should be nested in a "variables" object as Chris suggests, or left at the top-level as currently implemented. What are the implications for downstream users/programs if variables are nested in an object or left as is?

    I'm going with Pedro here (and it ws my idea :-) ) -- having a "variables" object makes it much easier for a client to figure out what the variables are. And then someone could name thier variabel "dimensions" if they wanted (is that alredy forbidded by the netcdf spec?)

    Put it this way -- I am liley to write a reader for shis in Python -- if the variables aren't in theri won object, then I'm going to go though the file and PUT them in their own object -- to amke it easier on users, and to make it more consistent with teh python netCDF4 lib.

    And yes, I htink we should be constent with CDL whereever it makes sense.

     
  • Christopher Barker

    1. Personally, I do not want to see a group explicitly labeled "root". If those who do feel strongly, please add some justification.

    Well, the jsustification is that it provides a more consistent interface.

    But making the root group the "root" of the JSON is more consistent with current netcdf practice, so we might as well stick with that.

     
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    @Charlie

    Should resources or volunteers become available, we could change our backend from hand-coded to Jansson-based. Right now our scope is limited to producing JSON that won't embarass us in the future, so getting a consensus on the format is most important

    Yes, I agree that now is most important to define the format. I meant to use Jansson in the future, like 4.6.3

    Since it seems we have an agreement on the format I'll try to code a little Jansson prototype in C that reads this format and outputs a netCDF file. Could 4.6.2.wait a week or so? Unrelated, I also need a to port the Windows code to Visual Studio 2015

     

    Last edit: Pedro Vicente 2016-11-04
    • Charlie Zender

      Charlie Zender - 2016-11-04

      Yes, 4.6.2 is at least a week away. And we have no consensus on two issues, so perhaps longer. Having a reader would be nice but not necessary before having a writer.

       
  • Charlie Zender

    Charlie Zender - 2016-11-04

    NB: just fixed values of "type" key to be consistent with NC_TYPE tokens in all cases. Now "ushort", "int64", etc. Updated in.json and in_grp.json are uploaded after each change to http://dust.ess.uci.edu/tmp

     
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    @Chris

    Interesting -- I honestly had no idea attributes were typed

    yes, my bad on this, the attributes format needs a review.
    netCDF/HDF5 attributes are almost exactly like variables/datasets, they have a type, and are arrays of data

     
  • Charlie Zender

    Charlie Zender - 2016-11-04

    I concur that variable attributes should be placed in an "attributes" object, much like group/global attributes already are. Most convincing to me, perhaps, is that otherwise attributes named "data", "dims", or "type" would be precluded. Thanks to Chris, Pedro, and Henry for hashing this through with me.

     
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    hmm, I need some time to digest all the previous commennts, but what do you think about this?

    I have to double check the netCDF spec about attributes but I think they define an attribute as an array of data that has
    1) a type
    2) a size

    so, this would be a netCDF file with

    2 dimensions at root,
    a variable named "var_1" that has those 2 dimensions and a float type
    it has 2 attributes

    "attr_1" is an array of floats with size 3, and value a JSON array [1,2,3]
    "attr_2" is an array of char (a C string) with size 3 and value "foo", a JSON string

    {
       "dimensions":{
          "lat":2,
          "lon":3
       },
       "variables":{
          "var_1":{
             "dimensions": ["lat","lon"],
             "type":"float",
             "data":[1,2,3,4,5,6],
             "attributes":{
                "attr_1":{
                   "size":3,
                   "type":"float",
                   "data":[1,2,3]
                },
                "attr_2":{
                   "size":3,
                   "type":"string",
                   "data":"foo"
                }
             }
          }
       }
    }
    
     

    Last edit: Pedro Vicente 2016-11-04
    • Charlie Zender

      Charlie Zender - 2016-11-04

      This is what I envisage in pedantic mode, with one exception: The "size" field is unnecessary because it can be obtained by counting the elements in the "data" field. In default (non-pedantic) mode the attributes would be un-tyuped, and represented as simple key-value pairs, not as objects.

       
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    here's a usefual JSON validator

    https://jsonformatter.curiousconcept.com/

     
  • Pedro Vicente

    Pedro Vicente - 2016-11-04

    @Charlie

    The "size" field is unnecessary because it can be obtained by counting the elements in the "data" field.

    That's true for a computer program. But for a poor human, what if the array is enormous :-)
    No way to get that from the JSON output !

    One of the goals of JSON is that it should be also humam friendly, and convey the information immediately, even if some redundance is added. And if the netCDF spec says that an attribute is someting with a size and type, then maybe we should copy the spec as much as possible. The API even has the size as an argument to the functions (true that in C we have to pass both the size and array anyway)

    In default (non-pedantic) mode the attributes would be un-tyuped,

    I have to go back and understand this pedantic / non pedantic part. There are no un-typed attributes in netCDF, so why to make up one?
    What would be the goal or advantage of having the non-pedantic ? and therefore 2 variations of the spec?
    for a client program that just means more complexity and more cases to deal with .

    but if we could agree at least on the pedantic version that would be great :-)

     

    Last edit: Pedro Vicente 2016-11-04
    • Charlie Zender

      Charlie Zender - 2016-11-04

      Please read the previous discussion on pedantic and non-pedantic. To that I will add:
      1. Attributes are not designed to be huge arrays. Using them to store data is a mis-use.
      2. "size" is redundant. Omitting it makes no difference to compliance with the spec. It would be possible, thought still redundant, to include "size" in pedantic mode without mandating it in the NCO default mode, which will be non-pedantic.
      3. Omitting the netCDF attribute type allows attributes to be key-value pairs, and lets the parser and JSON syntax decide what type to use. This is what I mean by "non-pedantic": Exact reproducibility of the original file is non guaranteed. So a round trip transformation would not necessarily be a clone.
      4. Non-pedantic is more aesthetic and human-readable.

       
      • Pedro Vicente

        Pedro Vicente - 2016-11-04

        I see, thanks for clarifying that.
        I forgot one thing, that one of the requirements of the HDF5 spec I wrote was to be able to inspect the size of each dataset and the size of the file remotely, these files are transmitted by TCP sockets on a client/server framework.

        In this case the need to specify "size" (and "rank" for datasets) is a must have, otherwise we would have to transmit the data to get the size.
        In my spec, the "data" part is optional, to goal is to be able to tramsmit the file metadata only, and the retrieve only parts of what is needed.

        So maybe we can continue next week to tune this, maybe there could be a "pedantic" or "strict" version and then the "non-pedantic".
        I am going to rewrite the HDF5 spec I did to this format we discussed here, having the metadata for the HDF5 dataset as a collection of objects (like defined above) is a much better way.

         
        • Charlie Zender

          Charlie Zender - 2016-11-08

          My understanding of the netCDF library is that it pre-loads all metadata into RAM, so that subsequent metadata access is fast. In other words, memory requirements for metadata are assumed to be minimal, so the dimensions of a variable and the size of an attribute array are not operationally equivalent. Am I mistaken about this? I see that knowing "size" would simplify parsing JSON->netCDF in a streaming environment, because one could define attributes directly without opening a temporary buffer to count the elements. How important is supporting streaming?

          Chris Barker, what is your opinion of adding a "size" element to attributes? It would require every attribute be an object. At what level of verbosity/pedanticness, if any, does it make the most sense?

           
          • Charlie Zender

            Charlie Zender - 2016-11-08

            Let me comment on my own comment above: AFAICT, netCDF is designed so that it's faster to always just send the metadata than to inquire about how much metadata there is. Attributes are usually size=1 for numeric types. For NC_CHAR size is C-string length. For NC_STRING, size is the number of strings (not the string length). The point being that it is likely faster to just send the strings without a size, than to add size (which is misleading anyway for NC_STRING) to every attribute. Attributes are designed to hold minimal data, so always knowing their size in advance seems like it would slow down rather than speed up any network protocol.

            Note that in thinking this through, it's clear to me that an NC_STRING type will need to be treated in one of two ways in pedantic mode:
            1. size is the number of strings in the NC_STRING array (equivalent to netCDF usage). A parser must use strlen()-equivalent on the actual strings to determines memory/storage requirements. NB: this defeats the purpose of "size" as I understand it.
            2. size is the strlen() of the string. But this differes from how netCDF API treats "size", and begs the question of how to pass arrays of NC_STRING that are perfectly legal netCDF.

            Hence I am not convinced of the utility of "size" in pedantic mode, and the arguments above show it creates issues with NC_STRING.

             
            • Pedro Vicente

              Pedro Vicente - 2016-11-08

              @Charlie

              For NC_CHAR size is C-string length

              yes, in the above example, it should be

              "attr_2":{
                             "type":"NC_CHAR",
                             "data":"foo"
                          }
              

              I agree in dropping the "size" key. The data size can be obtained by inspecting the JSON "data", or a call can be made to the netCDF API to get the attribute size.

              I propose that "type" matches the netCDF name (e.g "NC_CHAR" instead of "char")

               

              Last edit: Pedro Vicente 2016-11-08
              • Pedro Vicente

                Pedro Vicente - 2016-11-08

                the jansson API also has a function to get the JSON array size or the JSON string size, so the call to netCDF is not needed, and "size" is really not needed

                kudos for Jansson, well done

                else if (std::string(json_key).compare("data") == 0)
                    {
                      //data can be a JSON array for netCDF numerical types or a JSON string for netCDF char
                      assert(json_is_array(json_value) || json_is_string(json_value));
                      if (json_is_array(json_value))
                      {
                        size_t size_arr = json_array_size(json_value);
                        std::cout << json_key << ": " << " has " << size_arr << " elements" << std::endl;
                      }
                      else if (json_is_string(json_value))
                      {
                        size_t size_arr = json_string_length(json_value);
                        std::cout << json_key << ": " << " has " << size_arr << " elements" << std::endl;
                      }
                    }
                
                 
  • Pedro Vicente

    Pedro Vicente - 2016-11-08

    I added a prototype implementation at

    https://github.com/pedro-vicente/json-netcdf

    the "size" and "data" keys can be made optional, with that, I think it matches the pedantic version.
    The specification is json_netcdf.html

     

    Last edit: Pedro Vicente 2016-11-08
    • Pedro Vicente

      Pedro Vicente - 2016-11-08

      To use

      git clone https://github.com/pedro-vicente/json-netcdf.git
      cd json-netcdf/build
      cmake ..
      make
      ./netcdf_json ../data/netcdf_04.json

      all the 4 main objects in the spec ("groups", "variables", "attributes", "dimensions") are parsed and further processed. At the moment this is printed

      /: has dimensions,groups,variables,attributes,
      /:dimension:lat:2
      /:dimension:lon:3
      /:group:g1
      g1: has groups,
      g1:group:g11
      g11: has 
      /:group:g2
      g2: has dimensions,variables,
      g2:dimension:lat:2
      g2:dimension:lon:3
      g2:variable:var_1
      dimensions:lat:lon
      type: NC_FLOAT
      data:  has 6 elements
      var_1:attribute:attr_1
      type: NC_FLOAT
      data:  has 3 elements
      /:variable:var_1
      dimensions:lat:lon
      type: NC_FLOAT
      data:  has 6 elements
      var_1:attribute:attr_1
      type: NC_FLOAT
      data:  has 3 elements
      var_1:attribute:attr_2
      type: NC_CHAR
      data:  has 3 elements
      /:attribute:attr_1
      type: NC_FLOAT
      data:  has 3 elements
      
       
  • Charlie Zender

    Charlie Zender - 2017-03-22

    Just noticed a opportunity to simplify the JSON format, not sure what to do:

    zender@aerosol:~/nco$ ncks --jsn_fmt=0 -v att_var ~/nco/data/in.nc
    {
      "dimensions": {
        "vrt_nbr": 2,
        "time": 10
      },
      "variables": {
        "att_var": {
          "dims": ["time"],
          "type": "float",
          "attributes": {
            "byte_att": [0, 1, 2, 127, -128, -127, -2, -1],
    

    The issue is that "dimensions" is the keyword for dimensions in a group object, while we shorten that to "dims" for the keyword in a variable object. Is there any reason not to use "dimensions" in both the group and variable objects? Or is there a reason to keep them distinct? Any parser will know (or could be written to know) whether the immediate context of "dimensions" is in a variable object or a group object. Opinions?

     
  • Pedro Vicente

    Pedro Vicente - 2017-03-24

    I vote for using "dimensions" in both cases.
    Indeed the specification prototype that I implemented uses "dimensions" for groups and variables

    https://github.com/pedro-vicente/json-netcdf/blob/master/netcdf_json.html

     

    Last edit: Pedro Vicente 2017-03-24
<< < 1 2 3 > >> (Page 2 of 3)

Log in to post a comment.