NCO netCDF Operators / Discussion / Developers: ncap is nearly ready for primetime

Charlie Zender - 2001-12-03

Hi Henry,

ncap is looking great...I'm almost speechless because it's so
powerful. The size of the contribution is so immense it's going
to take me some time just to evaluate it all.
You are now quite the Bison, flex, and NCO expert!
It's amazing to see arbitrary algebraic expressions computed
and stored in netCDF format without re-compilation.
Sure the code needs cleaning and there are a few glitches, but all the
important stuff is there or is now much easier to add.

I'm very tempted to start "cleaning up" the new code (that's as
good a way to learn what you've done as any) but I want to
give you a chance to finish any loose ends. Let me know when
you're happy with the basic code and would not mind me doing
some (non-answer changing) "cleanup".

Here are some points, in no particular order, from my initial run
through your test script and code examination:

1. There are very few comments explaining what the new functions do.
A "Purpose: " line at the beginning of each function would be good.
In my opinion there is no such thing as too much documentation.

2. The ability to mix variables and attributes is very powerful
and very useful. I did not expect you to attempt this and it is
mostly working. One thing I could not figure out, however, was how
to specify global attributes rather than variable attributes.
ncatted allows the name "global", when used in place of a variable
name, to indicate the attribute is a global attribute and not
a variable attribute. I've patched ncap.l to accept "global" to
mean a global attribute. ":att_nm" could be implemented the same way.
This is the only ncap-related patch that I have checked in.

3. Attributes can be zero or one-dimensional.
ncatted infastructure supports both fully, but ncap has
the problem that one dimension attributes cannot be converted
to variables. Presumably this is difficult because there is no
existing dimension corresponding to the number of elements
in the attribute so ncap would have to generate (and name)
the new dimensions on the fly. It seems like it should be
easier, however to allow one dimensional attributes to be
defined from one-dimensional variables, since all the
required sizes exist. Do you think this is relatively straightforward?

4. One of the most desirable new features would be the ability
NOT to copy non-derived variables into the output file. In other
words to make it possible for the output file to be composed
entirely of variables that occur on the LHS of statements in
the input script (plus any associated coordinates, of course).
Unless I am mistaken this capability does not currently exist.
Do you feel like taking on this task or should I?

5. I recommend you change from the ":" symbol (which CDL uses)
to the "@" symbol to separate attribute names from variable
names. The reason is that no one uses CDL, but many people use
fortran9x, which sets the de facto standard for array notation.
Looking to the future, we will want to be able to dynamically
hyperslab arrays in ncap using fortran array notation, for
which ":" is the delimiter. Examples of fortran9x-like statements
that could potentially be supported in future versions of ncap are
three_dmn_var=four_dmn_var(:,3,:,:)*9.0*foo
two_dmn_var=three_dmn_var(1:lat_nbr,4,2:lon_nbr)*9.0*foo
Also we should reserve the "->" combination to eventually allow
file->variable@attribute notation in the scripts (another high
level language named NCL uses -> for this purpose), and reserve
"#" for matrix multiplication, e.g., c=a#b.

6. Hardcoded dimension sizes that are exposed to the run-time
environment should be avoided because they are a security risk.
Lines like
aed_sct *att_lst[500];
seem to be vulnerable to malicious attacks by users who implement
501 attributes and then use the memory overrun to gain shell access.
The first step toward eliminating these is to name the bounds, e.g.,
const int att_nbr_max=500;
aed_sct *att_lst[att_nbr_max];
so there are no magic numbers (500) running around.
Even better is to figure out what the number should be at runtime.

7. Probably the most challenging and useful feature which
is not currently implemented is the inability of the LHS to
track the size of the expression resulting from the RHS.
The LHS appears to receives the rank of the first variable
on the RHS, rather than the greatest-rank variable on the RHS.
Thus foo=var_1D*var_2D causes foo to be a 1D var, while
foo=var_2D*var_1D causes foo to be a 2D var (which happens to be
correct in this example). Do you have plans to fix this?
Until it is fixed ncap should probably die with a warning
that the arguments on the RHS must have the same rank or else
the LHS might have an incorrect rank.

8. Thanks for using CVS. It really improves collaboration.
But please take the additional step of adding entries to
nco/doc/ChangeLog so that I can read this to keep track
of what you are checking in. FYI, C-x 4a in emacs automatically
formats ChangeLog messages.

It seems like the fix to #7 will have to involve keeping track of the RHS
rank before defining (on disk, anyway) the LHS variable. The fix
should also be compatible with future syntax/operator improvements to
ncap. In particular, I am thinking that we will want to add
ncwa-like dimension altering operations to ncap syntax. For example,
foo=mean(var_3D) could define foo as a scalar. The possibilities
are endless, and probably difficult to implement, but the solution
to the problem of getting the correct rank on the LHS should be
easily generalizable so that some functions (e.g., mean, sum)
are allowed to change the rank of their operands.

Well, I ended up writing more than I probably should have.
But you've implemented a very rich numerical environment and
their are undoubtedly important issues I've neglected.
Please let me know your thoughts on the above issues.
I think ncap will be ready to release/announce as alpha-state
software once #7 is addressed.

I've tagged current version with your ncap changes as nco-2_1_0.
Please feel free to make new tags once you implemented new
features that work. I'll post this to sourceforge as well.
Please followup discussion there.

Thanks,
Charlie

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2001-12-03
  
  Hi Charlie,
  
  You have raised some valid and challenging points.
  
  3) Early on in the project I made a fundamental mistake and decided to use the val_unn structure rather than ptr_unn in parse_sct. This means that "attributes" in ncap.y can only hold a single value. It is now too late to go back and change things. However it will be relatively easy to save one-dimesional variables into att_lst as this doesn't involve using parse_sct. Remember though, if these 1-D attributes are subsequently read then only the first value will be used.
  
  4) If you specify in the extraction list, one of the variables on the LHS (assuming this variable is also defined in the input file) then ncap should extract only those variables on LHS in the command script and the associated dimensions.
  e.g ncap -v one -S try10.m in.nc foo.nc
  Note also that the variables associated with attributes on the LHS are also extracted.
  It you want to implement this feature with a separate flag, sure go ahead.
  
  5) I shall change ncap.l so that var_nm@att_nm works.
  
  7) It is non-trivial to work out the final rank of an expression on the RHS. I proprose a compromise. What if for the var +,-,* operations I check the rank of both the operands and put the result into the highest ranking of the two ? The program will still crash out if users try to operate on variables with no common dimensions ...I shall think a bit more on this problem...
  
  TODO
  ----
  I shall do a code clean-up this week and implement 1),3),5),6)
  
  Other issues to ponder.
  9) In ncap.c , The non-processed variables are copied into the output file prior to the parser being called. This creates a very sluggish performance because ncap_var_init() looks in the output file before the input file for variables; so all the i/o is on the output file.
  If we copy the non-processed variables after the parser then performance is much enhanced but the file is dis-ordered,with new variables first, co-ordinate variables and then fixed variables...
  Also in ncap.c I don't think we need var_lst_divide()
  
  10) Variable operations in ncap.y
  
  At the moment , when an operation is performed on a variable the resultant is allocated new space and the input variables discarded. To speed things up I could pass onto the resultant variable the pointers in the appropriate input variable( rather than using var_dpl()) This would avoid un-ncecessary use of malloc and free, but would complicate the later freeing of memory...
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2001-12-04
  
  > 3) Early on in the project I made a fundamental mistake and decided to use the
  > val_unn structure rather than ptr_unn in parse_sct. This means that "attributes"
  > in ncap.y can only hold a single value. It is now too late to go back and change
  > things. However it will be relatively easy to save one-dimesional variables
  > into att_lst as this doesn't involve using parse_sct. Remember though, if these
  > 1-D attributes are subsequently read then only the first value will be used.
  
  OK
  
  > 4) If you specify in the extraction list, one of the variables on the LHS (assuming
  > this variable is also defined in the input file) then ncap should extract only
  > those variables on LHS in the command script and the associated dimensions.
  > e.g ncap -v one -S try10.m in.nc foo.nc
  > Note also that the variables associated with attributes on the LHS are also
  > extracted.
  > It you want to implement this feature with a separate flag, sure go ahead.
  
  Let me see if I have this straight:
  It seems like if the extraction list contains a variable already in the input file
  then it and all variables on the LHS of the script will be extracted.
  Variables defined solely in the script are not allowed in the extraction list,
  but are automatically placed in the output file.
  Thus all variables defined in the extraction list will always
  be in the output file. Is this correct?
  
  > 5) I shall change ncap.l so that var_nm@att_nm works.
  
  Good. Note that the mnemonic is better since "@" = at = attribute.
  
  > 7) It is non-trivial to work out the final rank of an expression on the RHS.
  > I proprose a compromise. What if for the var +,-,* operations I check the rank
  > of both the operands and put the result into the highest ranking of the two
  > ? The program will still crash out if users try to operate on variables with
  > no common dimensions ...I shall think a bit more on this problem...
  
  Yes it is non-trivial. Right now there are no rank-reducing operations
  supported by ncap so the LHS should have the same rank as the largest
  rank variable on the RHS. Using var_conform_dim to convert everything
  to this rank may be overkill but it should solve the problem and
  improvements, such as "broadcasting" variables only as needed, can
  be done later. No legal expression should contain variables with
  non-conforming dimensions so ncap should exit with an error in such
  cases. Consider the extreme expression
  x=1+1+1+1+four_dmn_var
  It would be nice if the first three additions were accomplished as
  scalar addition (i.e., added as single-element arrays) and only in
  the fourth addition was the scalar (now 3) broadcast into a rank
  four array and added element by element to four_dmn_var. I believe
  the parser receives tokens from the inside out and from left to
  right so achieving this kind of performance is certainly feasible.
  
  It is still true that if the LHS is defined in the output file
  before it is computed that the RHS must at least be scanned once
  to determine the rank of the LHS. The only way that I can see
  getting around this is not defining the LHS in the output file
  until the RHS is evaluated and its final rank is known. At that
  point the LHS can be defined and written to disk (if necessary)
  in one fell swoop. I know this is not the way things are done
  now because it's easier to define the LHS variable on disk
  before computing it. But delaying the definition until the RHS
  is evaluated is probably smarter in the long run. When ncap
  begins to support dimension reducing operations (such as mean())
  then it will become even more non-trivial to determine the LHS
  rank by a simple scan of the RHS variables.
  
  > TODO
  > ----
  > I shall do a code clean-up this week and implement 1),3),5),6)
  
  Cool.
  
  > Other issues to ponder.
  > 9) In ncap.c , The non-processed variables are copied into the output file prior
  > to the parser being called. This creates a very sluggish performance because
  > ncap_var_init() looks in the output file before the input file for variables;
  > so all the i/o is on the output file.
  
  I do not understand why this would cause sluggish performance.
  Why would reading from the output file be slower than reading from the
  input file? I can see that it results a delay of the first computation,
  but not why it slows down the entire script since the reads and writes
  must all be performed eventually anyway.
  
  > If we copy the non-processed variables after the parser then performance is
  > much enhanced but the file is dis-ordered,with new variables first, co-ordinate
  > variables and then fixed variables...
  
  Please explain why you think performance is much enhanced.
  I believe you, but I do not understand why.
  I do not worry about file "order" anymore except to try to
  make sure that all the non-record variables are defined (but not
  necessarily written) before the first record variable is written.
  Adding new non-record variables to a file already containing
  record variables will cause mucho disk activity as all record
  variables must then be moved.
  
  As far as I can tell, you are doing conservative memory management:
  Newly defined variables kept written to disk immediately and then
  re-read from disk as needed in future statements, rather than being
  kept in memory for the whole program. I think this is wise.
  However, it does hog-tie you into having to write all LHS variables
  to disk, even if the user intends some of those to be only
  "intermediate" variables which he does not want to keep on disk.
  So we should think about alternatives for this, too.
  It may eventually make good sense to scan the whole script
  and keep in memory those variable that will be used in future
  statements, but write to disk (if requested) and then delete
  those variables that are not needed anymore. I know this is
  non-trivial to implement but I'm only thinking out loud about
  how the ideal solution would look, not necessarily expecting
  anyone to implement it.
  
  > Also in ncap.c I don't think we need var_lst_divide()
  
  Maybe not. But there are problems with the extraction list
  that I've already mentioned (LHS is always output to disk,
  LHS-defined variable cannot be specified in extraction list).
  Probably an ncap-specific method is required to address
  these problems.
  
  > 10) Variable operations in ncap.y
  >
  > At the moment , when an operation is performed on a variable the resultant is
  > allocated new space and the input variables discarded. To speed things up I
  > could pass onto the resultant variable the pointers in the appropriate input
  > variable( rather than using var_dpl()) This would avoid un-ncecessary use of
  > malloc and free, but would complicate the later freeing of memory...
  
  Yes, this could open up a Pandora's box. I would say this type
  of optimization should wait until ncap is a bit more mature and
  things are not changing so rapidly.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - henry Butowsky - 2001-12-04
    
    Hi Charlie,
    I agree with you that the current extraction list method is over engineered. I shall simplify it so that if the -v option is used then only the variables on the LHS of the command script are output. Hope this is OK
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - henry Butowsky - 2001-12-11
      
      hi Charlie,
      I think are we ready for a code-clean up. I tried to hack the rank problem - see ncap_var_conform_dim in ncap_utl.c . It not working properly at the mo.
      if you wont to get an insight into how the parser is working - set debug=1 in ncap.y. You get loads of output so try it with the -s switch of ncap. e.g ncap -s one=1+1+1+three_dmn_var_int in.nc foo.nc
      
      Regards Henry
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ncap is nearly ready for primetime

Command-line operators for netCDF and HDF files

Forums

Help

ncap is nearly ready for primetime

ncap is nearly ready for primetime

Command-line operators for netCDF and HDF files

Forums

Help

ncap is nearly ready for primetime document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

ncap is nearly ready for primetime