Menu

ncap is nearly ready for primetime

Developers
2001-12-03
2013-10-17
  • Charlie Zender

    Charlie Zender - 2001-12-03

    Hi Henry,

    ncap is looking great...I'm almost speechless because it's so
    powerful. The size of the contribution is so immense it's going
    to take me some time just to evaluate it all.
    You are now quite the Bison, flex, and NCO expert!
    It's amazing to see arbitrary algebraic expressions computed
    and stored in netCDF format without re-compilation.
    Sure the code needs cleaning and there are a few glitches, but all the
    important stuff is there or is now much easier to add.

    I'm very tempted to start "cleaning up" the new code (that's as
    good a way to learn what you've done as any) but I want to
    give you a chance to finish any loose ends. Let me know when
    you're happy with the basic code and would not mind me doing
    some (non-answer changing) "cleanup".

    Here are some points, in no particular order, from my initial run
    through your test script and code examination:

    1. There are very few comments explaining what the new functions do.
    A "Purpose: " line at the beginning of each function would be good.
    In my opinion there is no such thing as too much documentation.

    2. The ability to mix variables and attributes is very powerful
    and very useful. I did not expect you to attempt this and it is
    mostly working. One thing I could not figure out, however, was how
    to specify global attributes rather than variable attributes.
    ncatted allows the name "global", when used in place of a variable
    name, to indicate the attribute is a global attribute and not
    a variable attribute. I've patched ncap.l to accept "global" to
    mean a global attribute. ":att_nm" could be implemented the same way.
    This is the only ncap-related patch that I have checked in.

    3. Attributes can be zero or one-dimensional.
    ncatted infastructure supports both fully, but ncap has
    the problem that one dimension attributes cannot be converted
    to variables. Presumably this is difficult because there is no
    existing dimension corresponding to the number of elements
    in the attribute so ncap would have to generate (and name)
    the new dimensions on the fly. It seems like it should be
    easier, however to allow one dimensional attributes to be
    defined from one-dimensional variables, since all the
    required sizes exist. Do you think this is relatively straightforward?

    4. One of the most desirable new features would be the ability
    NOT to copy non-derived variables into the output file. In other
    words to make it possible for the output file to be composed
    entirely of variables that occur on the LHS of statements in
    the input script (plus any associated coordinates, of course).
    Unless I am mistaken this capability does not currently exist.
    Do you feel like taking on this task or should I?

    5. I recommend you change from the ":" symbol (which CDL uses)
    to the "@" symbol to separate attribute names from variable
    names. The reason is that no one uses CDL, but many people use
    fortran9x, which sets the de facto standard for array notation.
    Looking to the future, we will want to be able to dynamically
    hyperslab arrays in ncap using fortran array notation, for
    which ":" is the delimiter. Examples of fortran9x-like statements
    that could potentially be supported in future versions of ncap are
    three_dmn_var=four_dmn_var(:,3,:,:)*9.0*foo
    two_dmn_var=three_dmn_var(1:lat_nbr,4,2:lon_nbr)*9.0*foo
    Also we should reserve the "->" combination to eventually allow
    file->variable@attribute notation in the scripts (another high
    level language named NCL uses -> for this purpose), and reserve
    "#" for matrix multiplication, e.g., c=a#b.

    6. Hardcoded dimension sizes that are exposed to the run-time
    environment should be avoided because they are a security risk.
    Lines like
    aed_sct *att_lst[500];
    seem to be vulnerable to malicious attacks by users who implement
    501 attributes and then use the memory overrun to gain shell access.
    The first step toward eliminating these is to name the bounds, e.g.,
    const int att_nbr_max=500;
    aed_sct *att_lst[att_nbr_max];
    so there are no magic numbers (500) running around.
    Even better is to figure out what the number should be at runtime.

    7. Probably the most challenging and useful feature which
    is not currently implemented is the inability of the LHS to
    track the size of the expression resulting from the RHS.
    The LHS appears to receives the rank of the first variable
    on the RHS, rather than the greatest-rank variable on the RHS.
    Thus foo=var_1D*var_2D causes foo to be a 1D var, while
    foo=var_2D*var_1D causes foo to be a 2D var (which happens to be
    correct in this example). Do you have plans to fix this?
    Until it is fixed ncap should probably die with a warning
    that the arguments on the RHS must have the same rank or else
    the LHS might have an incorrect rank.

    8. Thanks for using CVS. It really improves collaboration.
    But please take the additional step of adding entries to
    nco/doc/ChangeLog so that I can read this to keep track
    of what you are checking in. FYI, C-x 4a in emacs automatically
    formats ChangeLog messages.

    It seems like the fix to #7 will have to involve keeping track of the RHS
    rank before defining (on disk, anyway) the LHS variable. The fix
    should also be compatible with future syntax/operator improvements to
    ncap. In particular, I am thinking that we will want to add
    ncwa-like dimension altering operations to ncap syntax. For example,
    foo=mean(var_3D) could define foo as a scalar. The possibilities
    are endless, and probably difficult to implement, but the solution
    to the problem of getting the correct rank on the LHS should be
    easily generalizable so that some functions (e.g., mean, sum)
    are allowed to change the rank of their operands.

    Well, I ended up writing more than I probably should have.
    But you've implemented a very rich numerical environment and
    their are undoubtedly important issues I've neglected.
    Please let me know your thoughts on the above issues.
    I think ncap will be ready to release/announce as alpha-state
    software once #7 is addressed.

    I've tagged current version with your ncap changes as nco-2_1_0.
    Please feel free to make new tags once you implemented new
    features that work. I'll post this to sourceforge as well.
    Please followup discussion there.

    Thanks,
    Charlie

     
    • henry Butowsky

      henry Butowsky - 2001-12-03

      Hi Charlie,

      You have raised some valid and challenging points.

      3) Early on in the project I made a fundamental mistake and decided to use the val_unn structure rather than ptr_unn in parse_sct. This means that "attributes" in ncap.y can only hold a single value. It is now too late to go back and change things. However it will be relatively easy to save one-dimesional variables into att_lst as this doesn't involve using parse_sct. Remember though, if these 1-D attributes are subsequently read then only the first value will be used.

      4) If you specify in the extraction list, one of the variables on the LHS (assuming this variable is also defined in the input file) then ncap should extract only those variables on LHS in the command script and the associated dimensions.
      e.g ncap -v one -S try10.m in.nc foo.nc
      Note also that the variables associated with attributes on the LHS are also extracted.
      It you want to implement this feature with a separate flag, sure go ahead.

      5) I shall change ncap.l so that var_nm@att_nm works.

      7) It is non-trivial to work out the final rank of an expression on the RHS. I proprose a compromise. What if for the var +,-,* operations I check the rank of both the operands and put the result into the highest ranking of the two ? The program will still crash out if users try to operate on variables with no common dimensions ...I shall think a bit more on this problem...

      TODO
      ----
      I shall do a code clean-up this week and implement 1),3),5),6)

      Other issues to ponder.
      9) In ncap.c , The non-processed variables are copied into the output file prior to the parser being called. This creates a very sluggish performance because ncap_var_init() looks in the output file before the input file for variables; so all the i/o is on the output file.
      If we copy the non-processed variables after the parser then performance is much enhanced but the file is dis-ordered,with new variables first, co-ordinate variables and then fixed variables...
      Also in ncap.c I don't think we need var_lst_divide()

      10) Variable operations in ncap.y

      At the moment , when an operation is performed on a variable the resultant is allocated new space and the input variables discarded. To speed things up I could pass onto the resultant variable the pointers in the appropriate input variable( rather than using var_dpl()) This would avoid un-ncecessary use of malloc and free, but would complicate the later freeing of memory...  

      Regards Henry

       
    • Charlie Zender

      Charlie Zender - 2001-12-04

      > 3) Early on in the project I made a fundamental mistake and decided to use the
      > val_unn structure rather than ptr_unn in parse_sct. This means that "attributes"
      > in ncap.y can only hold a single value. It is now too late to go back and change
      > things. However it will be relatively easy to save one-dimesional variables
      > into att_lst as this doesn't involve using parse_sct. Remember though, if these
      > 1-D attributes are subsequently read then only the first value will be used.

      OK

      > 4) If you specify in the extraction list, one of the variables on the LHS (assuming
      > this variable is also defined in the input file) then ncap should extract only
      > those variables on LHS in the command script and the associated dimensions.
      > e.g ncap -v one -S try10.m in.nc foo.nc
      > Note also that the variables associated with attributes on the LHS are also
      > extracted.
      > It you want to implement this feature with a separate flag, sure go ahead.

      Let me see if I have this straight:
      It seems like if the extraction list contains a variable already in the input file
      then it and all variables on the LHS of the script will be extracted.
      Variables defined solely in the script are not allowed in the extraction list,
      but are automatically placed in the output file.
      Thus all variables defined in the extraction list will always
      be in the output file. Is this correct?

      > 5) I shall change ncap.l so that var_nm@att_nm works.

      Good. Note that the mnemonic is better since "@" = at = attribute.

      > 7) It is non-trivial to work out the final rank of an expression on the RHS.
      > I proprose a compromise. What if for the var +,-,* operations I check the rank
      > of both the operands and put the result into the highest ranking of the two
      > ? The program will still crash out if users try to operate on variables with
      > no common dimensions ...I shall think a bit more on this problem...

      Yes it is non-trivial. Right now there are no rank-reducing operations
      supported by ncap so the LHS should have the same rank as the largest
      rank variable on the RHS. Using var_conform_dim to convert everything
      to this rank may be overkill but it should solve the problem and
      improvements, such as "broadcasting" variables only as needed, can
      be done later. No legal expression should contain variables with
      non-conforming dimensions so ncap should exit with an error in such
      cases. Consider the extreme expression
      x=1+1+1+1+four_dmn_var
      It would be nice if the first three additions were accomplished as
      scalar addition (i.e., added as single-element arrays) and only in
      the fourth addition was the scalar (now 3) broadcast into a rank
      four array and added element by element to four_dmn_var. I believe
      the parser receives tokens from the inside out and from left to
      right so achieving this kind of performance is certainly feasible.

      It is still true that if the LHS is defined in the output file
      before it is computed that the RHS must at least be scanned once
      to determine the rank of the LHS. The only way that I can see
      getting around this is not defining the LHS in the output file
      until the RHS is evaluated and its final rank is known. At that
      point the LHS can be defined and written to disk (if necessary)
      in one fell swoop. I know this is not the way things are done
      now because it's easier to define the LHS variable on disk
      before computing it. But delaying the definition until the RHS
      is evaluated is probably smarter in the long run. When ncap
      begins to support dimension reducing operations (such as mean())
      then it will become even more non-trivial to determine the LHS
      rank by a simple scan of the RHS variables.

      > TODO
      > ----
      > I shall do a code clean-up this week and implement 1),3),5),6)

      Cool.

      > Other issues to ponder.
      > 9) In ncap.c , The non-processed variables are copied into the output file prior
      > to the parser being called. This creates a very sluggish performance because
      > ncap_var_init() looks in the output file before the input file for variables;
      > so all the i/o is on the output file.

      I do not understand why this would cause sluggish performance.
      Why would reading from the output file be slower than reading from the
      input file? I can see that it results a delay of the first computation,
      but not why it slows down the entire script since the reads and writes
      must all be performed eventually anyway.

      > If we copy the non-processed variables after the parser then performance is
      > much enhanced but the file is dis-ordered,with new variables first, co-ordinate
      > variables and then fixed variables...

      Please explain why you think performance is much enhanced.
      I believe you, but I do not understand why.
      I do not worry about file "order" anymore except to try to
      make sure that all the non-record variables are defined (but not
      necessarily written) before the first record variable is written.
      Adding new non-record variables to a file already containing
      record variables will cause mucho disk activity as all record
      variables must then be moved.

      As far as I can tell, you are doing conservative memory management:
      Newly defined variables kept written to disk immediately and then
      re-read from disk as needed in future statements, rather than being
      kept in memory for the whole program. I think this is wise.
      However, it does hog-tie you into having to write all LHS variables
      to disk, even if the user intends some of those to be only
      "intermediate" variables which he does not want to keep on disk.
      So we should think about alternatives for this, too.
      It may eventually make good sense to scan the whole script
      and keep in memory those variable that will be used in future
      statements, but write to disk (if requested) and then delete
      those variables that are not needed anymore. I know this is
      non-trivial to implement but I'm only thinking out loud about
      how the ideal solution would look, not necessarily expecting
      anyone to implement it.

      > Also in ncap.c I don't think we need var_lst_divide()

      Maybe not. But there are problems with the extraction list
      that I've already mentioned (LHS is always output to disk,
      LHS-defined variable cannot be specified in extraction list).
      Probably an ncap-specific method is required to address
      these problems.

      > 10) Variable operations in ncap.y
      >
      > At the moment , when an operation is performed on a variable the resultant is
      > allocated new space and the input variables discarded. To speed things up I
      > could pass onto the resultant variable the pointers in the appropriate input
      > variable( rather than using var_dpl()) This would avoid un-ncecessary use of
      > malloc and free, but would complicate the later freeing of memory...

      Yes, this could open up a Pandora's box. I would say this type
      of optimization should wait until ncap is a bit more mature and
      things are not changing so rapidly.

       
      • henry Butowsky

        henry Butowsky - 2001-12-04

        Hi Charlie,
        I agree with you that the current extraction list method is over engineered. I shall simplify it so that if the -v option is used then only the variables on the LHS of the command script are output. Hope this is OK

         
        • henry Butowsky

          henry Butowsky - 2001-12-11

          hi Charlie,
          I think are we ready for a code-clean up. I tried to hack the rank problem - see ncap_var_conform_dim in ncap_utl.c . It not working properly at the mo.
          if you wont to get an insight into how the parser is working - set debug=1 in ncap.y. You get loads of output so try it with the -s switch of ncap. e.g ncap -s one=1+1+1+three_dmn_var_int in.nc foo.nc

          Regards Henry

           

Log in to post a comment.