NCO netCDF Operators / Discussion / Developers: MPI Optimization

henry Butowsky - 2007-01-23

i Guys,
   I've had a crack at comming up with an algorithm for sorting a list of "expressions"   for execution by a multi-processor system(s)
   Before you ask --yes I've had the sissors and paper out and have been scheming and sorting to not much effect.!!!
   The bottom line is that whatever method you use, all the Lvalues from all the expressions must be extracted and then searched for in each expression in the list . See below

Consider the following code fragment

1)    A=(B, C )    0
2)    C= (D, E)    1
3)    F=(A,C)        1,2
4)    G=(C,F)        2,4
5)    H=(F,G)        3,4
6)    L=(M1,M2,M3)    0
7)    M=(M2,M3,M4)    0
9)    N=(H,L);    5,6
10)    N=(L,M,A)    1,6,7,9
11)    O=(N1,N2,N3)    0
12)    P=(L,M,O)    6,7,11
13)    Q=(A,C,F,O)    1,2,3,11
14)    R=(O,O1,O2,O3)    11
15)    S=(L,M)    6,7

Using the "pointers" on the right we get the following list

1)    A=(B, C )    0
6)    L=(M1,M2,M3)    0
7)    M=(M2,M3,M4)    0
11)    O=(N1,N2,N3)    0
--------------------------------------------
2)    C=(D, E)    1
12)    P=(L,M,O)    6,7,11
14)    R=(O,O1,O2,O3)    11
15)    S=(L,M)    6,7
-------------------------------------------
3)    F=(A,C)    1,2
------------------------------------------
4)    G=(C,F)    2,3
13)    Q=(A,C,F,O)    1,2,3,11
-----------------------------------------
5)    H=(F,G)    3,4
-----------------------------------------
9)    N=(H,L)    5,6
-----------------------------------------
10)    N=(L,M,A)    1,6,7,9
------------------------------------------

It feels like there should some some elegant sort like algorithm but I've yet to find it
With ncap2 note the following
A) Variables in the above fragment can be nco vars or attributtes.

B) A variable can appear more than once on the LHS -- e.g a variable initialization followed later by a LHS hyperslab, so
the order of execution of these two statements in the final script has to be preserved

C) An expression can have more than one Lvalue e.g
    time_new= ++time;       D)   A variable can appear first   on the   RHS then later on the LHS . This occurs in the above script.
      (Statements 1,2)

Regards Henry

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel L. Wang - 2007-01-24
  
  Hi Henry,
  
  This is a fascinating and deep topic you're exploring. I think you'll find that looking for 'instruction scheduling' for compilers will produce a lot of relevant information. Instead of machine instructions, you're scheduling ncap2 statements, and instead of memory fetches, you're reading from netcdf files.
  
  There is definitely a need to extract all lvalues, and you are definitely correct in your observations. The rhs of a statement, in ncap2 grammar can contain statements, just like in C, as you pointed out, and is typically split out, unless you target an architecture that has instructions that have multiple lvalues. Also, case B is called a 'write-after-write (WAW) hazard' by computer architects and an 'output dependency' by computer scientists. It sometimes helps to model the statement as having an extra implied dependency when scheduling, i.e. a=b+c is dependent on a, b, and c. This will give you correct results, but if you're fancy, you can rename variables and fix them at the end to allow more parallelism (see 'register renaming' for more info).
  
  Anyway, the approach you're using (sorting the statements into blocks) is sound, and should produce a speedup in *most* cases. Unfortunately, in some cases, you reduce locality enough so that you lose performance by cache misses or memory pollution. But don't worry about those cases. Here's my advice:
  1) stay simple-- like you're doing things now
  2) go dynamic-- don't try to statically optimize a schedule at parse time.
  3) have a finite lookahead('issue window')-- don't look too far ahead for work to do. Unless you're gonna exploit gobs of memory on big machines, this will save memory and locality. I get away with my infinite lookahead because my machines have tons of memory, but this will bite me later.
  
  There really isn't any elegant algorithm, at least in practice. There are elegant algorithms for optimizing finish times by favoring the deepest root nodes, or ones that pick instructions with lots of branches (or leaves), but they end up not being that practical because they don't account for execution time or I/O time. This is why I favor the dynamic scheduling techniques implemented on modern CPUs. They're simple and they work.
  
  Please let me know if this makes sense or if you disagree or if you have a better idea.
  
  Cheers,
  -Daniel
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2007-02-12
  
  Hi Charlie,
  Have commited the first cut of ncap_mpi_srt()
  Have so far checked it with ncap2.in & ncap12.in.
  You can enable the mpi sort with the -x in the command line.
  If you switch on debug you can see the sorted basic blocks
  ( look for debug inputs starting with fnc=ncap_mpi_srt()
  
  e.g
  ncap -O -x -v -S ncap2.in in.nc foo.nc
  ncap -O -v -S ncap2.in in.nc foo1.nc
  ncbo foo.nc foo1.nc rslt.nc
  
  I need to spend a little bit of time re-writing my NcapVector object.
  Looking back I used compostion and function overloading of std::vector.
  A much better solution in to inhert the std::vector. It won't take long to do
  and downstream there will be many performance benefits
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2007-02-19
  
  Hi Charlie, any comments on the mpi_sort ?
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2007-03-13
  
  Hi Charlie,
  Been thinking ahead about threading and MPI.
  I think a lot of the *action* is going to revolve around prs_arg --
  the structure that maintains the vectors (symbol tables) which define the parser state.
  Currently prs_arg is defined as follows
  
  typedef struct{ /* prs_sct */
  public:
  char *fl_in; /* [sng] Input data file */
  int in_id; /* [id] Input data file ID */
  char *fl_out; /* [sng] Output data file */
  int out_id; /* [id] Output data file ID */
  
  NcapVector<dmn_sct*> *ptr_dmn_in_vtr; //Vector of dimensions in input file nb doesn't change
  NcapVector<dmn_sct*> *ptr_dmn_out_vtr; //Vector of dimensions in output file file
  NcapVector<sym_sct*> *ptr_sym_vtr;     //Vector of functions nb doesn't change
  NcapVarVector *ptr_var_vtr;            // list of attributes & variables
  NcapVarVector *ptr_int_vtr;            // stores vars/atts in FIRST PARSE
  bool ntl_scn;                          // [flg] Initial scan of script
  bool FORTRAN_IDX_CNV;                  //Use fortran convention with hyperslab indices
  bool ATT_PROPAGATE;                    //Var on LHS gets attributtes from the leftermost var on the RHS
  bool ATT_INHERIT;                      //Var on LHS inherits attributtes from var of the same name
                                           // in the input file
  bool NCAP_MPI_SORT;                    // sort exressions after second parse for MPI optimization
  
  } prs_sct;
  
  The two problem vectors will be
  NcapVector<dmn_sct*> *ptr_dmn_out_vtr; //Vector of dimensions in output file file
  NcapVarVector *ptr_var_vtr;            // list of attributes & variables
  
  and the two rountines to be threaded are "surprise-surprise"
  ncap_var_init()
  ncap_var_write()
  
  With my limited knowledge of MPI I can see the following
  ncap_var_init() checks that all the dims of the var to be initialized are in output. if not it adds them to output ( in define mode). The list of output vectors is ptr_dmn_out_vtr.
  If we have the following line in ascript
  
  time(0:3)=lon*10.0   ( lon is co-ordinate var defined ONLY in input!! dim lon=4)
  
  Then on the first parse time is defined in output and the RHS of the expression is ignored. On the second parse -the RHS side is evaluated and $lon is defined in output. A similar situation could occur with the ternary operator -
  
  time2(0:3) = time > 0 ? lon : nlon ;
  
  ncap_var_write()
  This will only be called in MPI mode when the var to write has already been defined in output.
  To stop the routine going into define mode we could save the attributes:-
  missing value, scale facor, offet value the int_vtr. Once the parser has fininshed we could write out
  these values allong with the regular attributes ?
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2007-03-15
    
    Hi Henry,
    
    > With my limited knowledge of MPI I can see the following
    > ncap_var_init() checks that all the dims of the var to be initialized are in
    > output. if not it adds them to output ( in define mode).
    
    Yes, this must be done in single-threaded mode with netCDF3, i.e.,
    by the master process before spawning additional threads/processes.
    
    > The list of output
    > vectors is ptr_dmn_out_vtr.
    > If we have the following line in ascript
    >
    > time(0:3)=lon*10.0 ( lon is co-ordinate var defined ONLY in input!!
    > dim lon=4)
    >
    > Then on the first parse time is defined in output and the RHS of the expression
    > is ignored. On the second parse -the RHS side is evaluated and $lon is defined
    > in output. A similar situation could occur with the ternary operator -
    >
    > time2(0:3) = time > 0 ? lon : nlon ;
    >
    > ncap_var_write()
    > This will only be called in MPI mode when the var to write has already been
    > defined in output.
    > To stop the routine going into define mode we could save the attributes:-
    > missing value, scale facor, offet value the int_vtr. Once the parser has fininshed
    > we could write out
    > these values allong with the regular attributes ?
    
    A couple of things come to mind:
    1. Read mpncecat.c. I think it is the simplest example of the token
    passing technique which is implemented in all the operators.
    The master process manages ownership of the write-access token.
    When a worker finishes its job, it requests the token from the
    master, writes the variable to the file, and gives the token back.
    While any worker has the token, the other workers are prevented from
    writing (but not from working). The same thing could happen with
    ncap2 eventually.
    
    2. We've talked about the notion of intermediate variables (RAM only,
    never get written to disk). Please propose a syntax/sentinel for them.
    
    3. We've talked about only writing to disk when X or more bytes are
    ready. Maybe time to add that infrastructure soon.
    
    4. The easiest form of threading is OpenMP which I envision threading
    the loop of statements within a basic block. If we can isolate the
    ncap_var_write() statement in an OpenMP critical section (see, e.g.,
    ncecat.c) then this should be possible soon (now?), without all the
    complexity MPI adds.
    
    Charlie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2007-03-16
  
  Hi Charlie,
  
  1) Still trying to understand openMP & MPI in ncecat & mpncecat
  
  2) RAM variables -I think we also going to need them for loops as well
     How about declaring RAM variable with '#' prefix ? Unless you have '#' in mind for matrix multiplication ?
     e.g #large_var[$time,$lat,$lon]=10.0;
     We will also need some kind of erase/delete var function so that users can manage their memory. I would like to implent this function like the other functions -so that I don't have to do anything special when I do the mpi sort. (This is why I have not yet implemented a print function) ( we did talk a while back about a general framework for functions/methods ) It would make sense for the RAM var to behave like regular vars i.e the first declaration defines the shape & type of var. If they want to use again for something else they will
  have to erase then redefine.
  e.g
  
  #fl[$time]=10.0;
  
  #fl(0:3)=1;   // integer is converted to type double
  erase(#fl);
  #fl=2000L;
  
  At them moment you can store a var in an attribute which is kind of like a RAM var. The attribute size is only checked in ncap2.cc when it gets written to disk (NC_MAX_ATTRS)
  
  3) Im not sure what you mean by intermediate variables. Was it the idea that you have a min size for a variable. So all vars less than the min size are stored in memory - and then written to disk when the parse is complete ?
  
  4) Both ncap_var_init() & ncap_var_write() will need rethinking for MPI. Perhaps it would be best to leave them as they are and add ncap_var_init_mpi() ncap_var_write_mpi() ?
  
  Regards Henry
  
  
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2007-03-16
    
    > 1) Still trying to understand openMP & MPI in ncecat & mpncecat
    
    OpenMP is much easier to understand and could be implemented in
    ncap2 very quickly. The OpenMP code is all in the regular ncecat.c (and in
    mpncecat.c). The only system on which I know you have a working
    OpenMP-enabled compiler is the ESMF. See below for example code.
    
    2) Let's use the function name "delete()" to remove RAM variables.
    Everything else here is fine, except I don't like the # sentinel.
    I'm thinking leading underscores, e.g., "_fred".
    
    3) For me, RAM variables = intermediate variables = variables you don't
    care to save in the output file. The storage thing was that we only
    write non-RAM variables to disk when a present memory threshold is reached.
    This reduces # of times we need to enter critical regions (for writing)
    and enter/exit define mode. This is lower priority for now.
    
    4) Like the other NCOs, we can implement OpenMP code directly in the
    existing functions. But the MPI code is complex so yes, let's create
    *_mpi() versions for that.
    
    There is an example of using the C++ OpenMP
    specification in ~zender/c++/ccc.cc which you can see work
    by executing
    
    /u/zender/bin/AIX/ccc --dbg=1 --tst="omp"
    
    zender@esmf04m:~/c++$ /u/zender/bin/AIX/ccc --dbg=1 --tst="omp"
    Start = Fri Mar 16 09:23:37 2007
    ccc version 20070316 built Mar 16 2007 on esmf04m by zender
    ccc: INFO Command line = /u/zender/bin/AIX/ccc --dbg=1 --tst=omp
    Number of wavelengths in /u/zender/nco/data/in.nc is 2
    Value of wvl[0] in /u/zender/nco/data/in.nc is 5e-07
    ccc: INFO Ingested /u/zender/nco/data/in.nc
    libcsz_c++ reports dbg_lvl_get() returns 1
    Testing OpenMP implementation...
    OpenMP Activation token _OPENMP is true
    ccc: INFO Value of _OPENMP token is 199810
    ccc: INFO Environment variable OMP_NUM_THREADS = 8
    ccc: INFO Maximum number of threads system allows = omp_get_max_threads() = 8
    ccc: INFO Number of processors available = omp_get_num_procs() = 8
    ccc: INFO Reducing number of threads from 8 to 4 since ccc hits I/O bottleneck above 4 threads
    ccc: INFO Allowing OS to utilize dynamic threading
    ccc: INFO System will utilize dynamic threading
    ccc: INFO OpenMP threading with omp_get_num_threads() = 4 threads
    Values != -1 prove that parallel loop was executed:
    thr_idx[0]=0
    thr_idx[1]=1
    thr_idx[2]=2
    thr_idx[3]=3
    
    Charlie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- henry Butowsky - 2007-03-19
  
  Hi Charlie,
  Implementing RAM vars isn't going to be as clean as I first thought.
  I don't think its a good idea to use '_' to indicate a RAM var as its one of the few metacharacters that ncgen allows as the first char in a variable name. Also I recently changed the prefix for parser generated variables from '_' to '~'. ( see ncoTree() ). Also from people from a C programming environment will naturally assume they can use '_' wherever they like.
  
  The sentinal pool is quite small -- other options are '&'.
  Or we could use '*' as prefix in the initial declaration to indicate RAM var.
  e.g
  *new1=time;
  new1+=10;
  *new2[$time,$lon]=20.0;
  
  Or we could declare the var a RAM type before we use it. e.g
  memory("var1");
  var1=three_dmn_var_dbl;
  new_avg=var1.avg($0);
  new_max=var1.max();
  new_min=var1.min();
  delete(var1);
  
  Another option is to use the pascal ':=' rather than '=' to indicate that the LHS is RAM var;
  var1:=three_dmn_var_dbl;
  var2[$lon,$lat]:=666;
  
  For LHS hyperslabs need an in memory nco_put_var() (there must be one about somewhere !!)
  For RHS hyperslabs need an in memory nco_get_get()
  Also may have to implement "delete()" in its own basic block- because subsequent redefinition of a RAM variable may mess up the first parse. e.g consider
  
  &ram1=three_dmn_var_dbl-1.0;
  var_avg=&ram1.avg($0);
  delete(&ram1);
  &ram1=time;
  r_max=&ram1.max();
  
  I favour '&'
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2007-03-19
    
    > Or we could use '*' as prefix in the initial declaration to indicate RAM var.
    
    Of the options you presented, I like this one best.
    The star convenys the notion of volatility.
    
    > For LHS hyperslabs need an in memory nco_put_var() (there must be one about
    > somewhere !!)
    > For RHS hyperslabs need an in memory nco_get_get()
    
    Can/should these be implemented as "
    
    if(var_is_volatile){getfromRAM}else{getfromdisklikenormal};
    
    blocks in the routines one level up from the nco_put/get_var() calls?
    
    > Also may have to implement "delete()" in its own basic block- because subsequent
    > redefinition of a RAM variable may mess up the first parse. e.g consider
    
    OK but no need to account for all the corner cases now.
    Let's walk before we dance.
    
    Charlie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MPI Optimization

Command-line operators for netCDF and HDF files

Forums

Help

MPI Optimization

MPI Optimization

Command-line operators for netCDF and HDF files

Forums

Help

MPI Optimization document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

MPI Optimization