Menu

MPI Optimization

Developers
2007-01-23
2013-10-17
  • henry Butowsky

    henry Butowsky - 2007-01-23

    i Guys,
       I've had a crack at comming  up with an algorithm for sorting a list of  "expressions"   for execution by a multi-processor system(s)
       Before you ask --yes I've had the sissors and paper out and have been scheming and sorting to not much effect.!!!
       The bottom line is that whatever method you use,  all the Lvalues from all the expressions must be extracted and then searched for in each expression in the list . See below

    Consider the following code fragment

    1)    A=(B, C )    0
    2)    C= (D, E)    1
    3)    F=(A,C)        1,2
    4)    G=(C,F)        2,4
    5)    H=(F,G)        3,4
    6)    L=(M1,M2,M3)    0
    7)    M=(M2,M3,M4)    0
    9)    N=(H,L);    5,6
    10)    N=(L,M,A)    1,6,7,9
    11)    O=(N1,N2,N3)    0
    12)    P=(L,M,O)    6,7,11
    13)    Q=(A,C,F,O)    1,2,3,11
    14)    R=(O,O1,O2,O3)    11
    15)    S=(L,M)    6,7

    Using the "pointers" on the  right we get the following list

    1)    A=(B, C )    0
    6)    L=(M1,M2,M3)    0
    7)    M=(M2,M3,M4)    0
    11)    O=(N1,N2,N3)    0
    --------------------------------------------
    2)    C=(D, E)    1
    12)    P=(L,M,O)    6,7,11
    14)    R=(O,O1,O2,O3)    11
    15)    S=(L,M)    6,7
    -------------------------------------------
    3)    F=(A,C)    1,2
    ------------------------------------------
    4)    G=(C,F)    2,3
    13)    Q=(A,C,F,O)    1,2,3,11
    -----------------------------------------
    5)    H=(F,G)    3,4
    -----------------------------------------
    9)    N=(H,L)    5,6
    -----------------------------------------
    10)    N=(L,M,A)    1,6,7,9
    ------------------------------------------

    It feels like there should some some elegant sort like algorithm but I've yet to find it
    With ncap2 note the following
    A) Variables in the above fragment can be nco vars or attributtes.

    B) A variable can appear more than once on the LHS -- e.g a variable initialization followed later by a LHS hyperslab, so
    the order of execution of these two statements  in the final script has to be preserved

    C) An expression can have more than one Lvalue e.g
        time_new= ++time;       D)   A  variable can  appear first   on the   RHS then later on the LHS . This occurs in the above script.
          (Statements 1,2)

    Regards Henry

     

     
    • Daniel L. Wang

      Daniel L. Wang - 2007-01-24

      Hi Henry,

      This is a fascinating and deep topic you're exploring.  I think you'll find that looking for 'instruction scheduling' for compilers will produce a lot of relevant information.  Instead of machine instructions, you're scheduling ncap2 statements, and instead of memory fetches, you're reading from netcdf files.

      There is definitely a need to extract all lvalues, and you are definitely correct in your observations.  The rhs of a statement, in ncap2 grammar can contain statements, just like in C, as you pointed out, and is typically split out, unless you target an architecture that has instructions that have multiple lvalues.  Also, case B is called a 'write-after-write (WAW) hazard' by computer architects and an 'output dependency' by computer scientists.  It sometimes helps to model the statement as having an extra implied dependency when scheduling, i.e. a=b+c is dependent on a, b, and c.  This will give you correct results, but if you're fancy, you can rename variables and fix them at the end to allow more parallelism (see 'register renaming' for more info).

      Anyway, the approach you're using (sorting the statements into blocks) is sound, and should produce a speedup in *most* cases.  Unfortunately, in some cases, you reduce locality enough so that you lose performance by cache misses or memory pollution.  But don't worry about those cases.  Here's my advice:
      1) stay simple-- like you're doing things now
      2) go dynamic-- don't try to statically optimize a schedule at parse time.
      3) have a finite lookahead('issue window')-- don't look too far ahead for work to do.  Unless you're gonna exploit gobs of memory on big machines, this will save memory and locality.  I get away with my infinite lookahead because my machines have tons of memory, but this will bite me later.

      There really isn't any elegant algorithm, at least in practice.  There are elegant algorithms for optimizing finish times by favoring the deepest root nodes, or ones that pick instructions with lots of branches (or leaves), but they end up not being that practical because they don't account for execution time or I/O time.  This is why I favor the dynamic scheduling techniques implemented on modern CPUs.  They're simple and they work.

      Please let me know if this makes sense or if you disagree or if you have a better idea.

      Cheers,
      -Daniel

       
    • henry Butowsky

      henry Butowsky - 2007-02-12

      Hi Charlie,
      Have commited the first cut of ncap_mpi_srt() 
      Have so far checked it with ncap2.in & ncap12.in.
      You can enable the mpi sort with the -x in the command line.
      If you switch on debug you can  see the sorted basic blocks
      ( look for debug inputs starting with fnc=ncap_mpi_srt()

      e.g
      ncap -O -x -v -S ncap2.in in.nc foo.nc
      ncap -O -v -S ncap2.in in.nc foo1.nc
      ncbo foo.nc foo1.nc rslt.nc

      I need to spend a little bit of time re-writing my NcapVector object.
      Looking back I used compostion and function overloading of std::vector.
      A much better solution in to inhert the std::vector. It won't take long to do
      and downstream there will be many performance benefits

      Regards Henry

       
    • henry Butowsky

      henry Butowsky - 2007-02-19

      Hi Charlie, any comments on the mpi_sort ?
      Regards Henry

       
    • henry Butowsky

      henry Butowsky - 2007-03-13

      Hi Charlie,
      Been thinking ahead about threading and MPI.
      I think a lot of the *action* is going to revolve around prs_arg --
      the structure that maintains the vectors (symbol tables) which define the parser state.
      Currently prs_arg is defined as follows

      typedef struct{ /* prs_sct */
      public:
        char *fl_in; /* [sng] Input data file */
        int in_id; /* [id] Input data file ID */
        char *fl_out; /* [sng] Output data file */
        int out_id; /* [id] Output data file ID */

        NcapVector<dmn_sct*> *ptr_dmn_in_vtr;  //Vector of dimensions in input file nb doesn't change
        NcapVector<dmn_sct*> *ptr_dmn_out_vtr; //Vector of dimensions in output file file
        NcapVector<sym_sct*> *ptr_sym_vtr;     //Vector of functions nb doesn't change
        NcapVarVector *ptr_var_vtr;            // list of attributes & variables
        NcapVarVector *ptr_int_vtr;            // stores vars/atts in FIRST PARSE
        bool ntl_scn;                          // [flg] Initial scan of script
        bool FORTRAN_IDX_CNV;                  //Use fortran convention with hyperslab indices
        bool ATT_PROPAGATE;                    //Var on LHS gets attributtes from the leftermost var on the RHS
        bool ATT_INHERIT;                      //Var on LHS inherits attributtes from var of the same name
                                               // in the input file
        bool NCAP_MPI_SORT;                    // sort exressions after second parse for MPI optimization
                                                   
      } prs_sct;

      The two problem vectors will be
      NcapVector<dmn_sct*> *ptr_dmn_out_vtr; //Vector of dimensions in output file file
      NcapVarVector *ptr_var_vtr;            // list of attributes & variables

      and the two rountines to be threaded are "surprise-surprise"
      ncap_var_init()
      ncap_var_write()

      With my limited knowledge of MPI I can see the following
      ncap_var_init() checks that all the dims of the var to be initialized are in output. if not it adds them to output ( in define mode). The list of output vectors is  ptr_dmn_out_vtr.
      If we have the following line in ascript

      time(0:3)=lon*10.0   ( lon is co-ordinate var defined ONLY in input!!  dim lon=4)

      Then on the first parse time is defined in output and the RHS of the expression is ignored. On the second parse -the RHS side is evaluated and $lon is defined in output. A similar situation could occur with the ternary operator -  

      time2(0:3) =  time > 0 ?  lon : nlon ;

      ncap_var_write()
      This will only be called in MPI mode when the var to write has already been defined in output.
      To stop the routine going into define mode we could save the attributes:-
      missing value, scale facor, offet value the int_vtr. Once the parser has fininshed we could write out
      these values allong with the  regular attributes ?

      Regards Henry

       
      • Charlie Zender

        Charlie Zender - 2007-03-15

        Hi Henry,

        > With my limited knowledge of MPI I can see the following
        > ncap_var_init() checks that all the dims of the var to be initialized are in
        > output. if not it adds them to output ( in define mode).

        Yes, this must be done in single-threaded mode with netCDF3, i.e.,
        by the master process before spawning additional threads/processes.

        > The list of output
        > vectors is  ptr_dmn_out_vtr.
        > If we have the following line in ascript
        >
        > time(0:3)=lon*10.0   ( lon is co-ordinate var defined ONLY in input!!
        > dim lon=4)
        >
        > Then on the first parse time is defined in output and the RHS of the expression
        > is ignored. On the second parse -the RHS side is evaluated and $lon is defined
        > in output. A similar situation could occur with the ternary operator -
        >
        > time2(0:3) =  time > 0 ?  lon : nlon ;
        >
        > ncap_var_write()
        > This will only be called in MPI mode when the var to write has already been
        > defined in output.
        > To stop the routine going into define mode we could save the attributes:-
        > missing value, scale facor, offet value the int_vtr. Once the parser has fininshed
        > we could write out
        > these values allong with the  regular attributes ?

        A couple of things come to mind:
        1. Read mpncecat.c. I think it is the simplest example of the token
        passing technique which is implemented in all the operators.
        The master process manages ownership of the write-access token.
        When a worker finishes its job, it requests the token from the
        master, writes the variable to the file, and gives the token back.
        While any worker has the token, the other workers are prevented from
        writing (but not from working). The same thing could happen with
        ncap2 eventually.

        2. We've talked about the notion of intermediate variables (RAM only,
        never get written to disk). Please propose a syntax/sentinel for them.

        3. We've talked about only writing to disk when X or more bytes are
        ready. Maybe time to add that infrastructure soon.

        4. The easiest form of threading is OpenMP which I envision threading
        the loop of statements within a basic block. If we can isolate the
        ncap_var_write() statement in an OpenMP critical section (see, e.g.,
        ncecat.c) then this should be possible soon (now?), without all the
        complexity MPI adds.

        Charlie

         
    • henry Butowsky

      henry Butowsky - 2007-03-16

      Hi Charlie,

      1) Still trying to understand openMP & MPI in ncecat & mpncecat

      2) RAM variables -I think we also going to need them for loops as well
         How about declaring RAM variable with '#' prefix ? Unless you have '#' in mind for matrix multiplication ?
         e.g #large_var[$time,$lat,$lon]=10.0;
         We will also need some kind of erase/delete var function so that users can manage their memory. I would like to implent this function like the other functions -so that I don't have to do anything special when I do the mpi sort. (This is why I have not yet implemented a print function) ( we did talk a while back about a general framework for functions/methods ) It would make sense for the RAM var to behave like regular vars i.e the first declaration  defines the shape & type of var. If they want to use again for something else they will   
      have to erase then redefine.
      e.g

      #fl[$time]=10.0;

      #fl(0:3)=1;   // integer is converted to type double
      erase(#fl);
      #fl=2000L;
        
      At them moment you can store a var in an attribute which is kind of like a RAM var. The attribute size is only checked in ncap2.cc when it gets written to disk (NC_MAX_ATTRS)

      3) Im not sure what you mean by intermediate variables. Was it the idea that you have a min size for a variable. So all vars less than the min size are stored in memory - and then written to disk when the parse is complete ?  

      4) Both ncap_var_init() & ncap_var_write() will need rethinking for MPI. Perhaps it would be best to leave them as they are and add ncap_var_init_mpi()  ncap_var_write_mpi() ?

      Regards Henry

        

       
      • Charlie Zender

        Charlie Zender - 2007-03-16

        > 1) Still trying to understand openMP & MPI in ncecat & mpncecat

        OpenMP is much easier to understand and could be implemented in
        ncap2 very quickly. The OpenMP code is all in the regular ncecat.c (and in
        mpncecat.c). The only system on which I know you have a working
        OpenMP-enabled compiler is the ESMF. See below for example code.

        2) Let's use the function name "delete()" to remove RAM variables.
        Everything else here is fine, except I don't like the # sentinel.
        I'm thinking leading underscores, e.g., "_fred".

        3) For me, RAM variables = intermediate variables = variables you don't
        care to save in the output file. The storage thing was that we only
        write non-RAM variables to disk when a present memory threshold is reached.
        This reduces # of times we need to enter critical regions (for writing)
        and enter/exit define mode. This is lower priority for now.

        4) Like the other NCOs, we can implement OpenMP code directly in the
        existing functions. But the MPI code is complex so yes, let's create
        *_mpi() versions for that.

        There is an example of using the C++ OpenMP
        specification in ~zender/c++/ccc.cc which you can see work
        by executing

        /u/zender/bin/AIX/ccc --dbg=1 --tst="omp"

        zender@esmf04m:~/c++$ /u/zender/bin/AIX/ccc --dbg=1 --tst="omp"
                Start = Fri Mar 16 09:23:37 2007
        ccc version 20070316 built Mar 16 2007 on esmf04m by zender
        ccc: INFO Command line = /u/zender/bin/AIX/ccc --dbg=1 --tst=omp
        Number of wavelengths in /u/zender/nco/data/in.nc is 2
        Value of wvl[0] in /u/zender/nco/data/in.nc is 5e-07
        ccc: INFO Ingested /u/zender/nco/data/in.nc
        libcsz_c++ reports dbg_lvl_get() returns 1
        Testing OpenMP implementation...
        OpenMP Activation token _OPENMP is true
        ccc: INFO Value of _OPENMP token is 199810
        ccc: INFO Environment variable OMP_NUM_THREADS = 8
        ccc: INFO Maximum number of threads system allows = omp_get_max_threads() = 8
        ccc: INFO Number of processors available = omp_get_num_procs() = 8
        ccc: INFO Reducing number of threads from 8 to 4 since ccc hits I/O bottleneck above 4 threads
        ccc: INFO Allowing OS to utilize dynamic threading
        ccc: INFO System will utilize dynamic threading
        ccc: INFO OpenMP threading with omp_get_num_threads() = 4 threads
        Values != -1 prove that parallel loop was executed:
        thr_idx[0]=0
        thr_idx[1]=1
        thr_idx[2]=2
        thr_idx[3]=3

        Charlie

         
    • henry Butowsky

      henry Butowsky - 2007-03-19

      Hi Charlie,
      Implementing RAM vars isn't going to be as clean as I first thought.
      I don't think its a good idea to use '_' to indicate a RAM var as its one of the few metacharacters that ncgen allows as the first char in a variable name. Also I recently changed the prefix for parser generated variables from '_' to '~'. ( see  ncoTree()  ). Also from people from a C programming environment will naturally assume they  can use '_' wherever they like.

      The sentinal pool is quite small -- other options are '&'.
      Or we could use '*' as prefix in the initial declaration to indicate RAM var.
      e.g
      *new1=time;
      new1+=10; 
      *new2[$time,$lon]=20.0;

      Or we could declare the var a RAM type before we use it. e.g
      memory("var1");
      var1=three_dmn_var_dbl;
      new_avg=var1.avg($0);
      new_max=var1.max();
      new_min=var1.min();
      delete(var1);

      Another option is to use the pascal ':=' rather than '=' to indicate that the LHS is RAM var;
      var1:=three_dmn_var_dbl;
      var2[$lon,$lat]:=666;

      For LHS hyperslabs need an in memory nco_put_var() (there must be one about somewhere !!)
      For RHS hyperslabs need an in memory nco_get_get()
      Also may have to implement "delete()" in its own basic block- because subsequent redefinition of a RAM variable may mess up the first parse. e.g consider

      &ram1=three_dmn_var_dbl-1.0;
      var_avg=&ram1.avg($0);
      delete(&ram1);
      &ram1=time;
      r_max=&ram1.max();

      I favour '&'

      Regards Henry

       
      • Charlie Zender

        Charlie Zender - 2007-03-19

        > Or we could use '*' as prefix in the initial declaration to indicate RAM var.

        Of the options you presented, I like this one best.
        The star convenys the notion of volatility.

        > For LHS hyperslabs need an in memory nco_put_var() (there must be one about
        > somewhere !!)
        > For RHS hyperslabs need an in memory nco_get_get()

        Can/should these be implemented as "

        if(var_is_volatile){getfromRAM}else{getfromdisklikenormal};

        blocks in the routines one level up from the nco_put/get_var() calls?

        > Also may have to implement "delete()" in its own basic block- because subsequent
        > redefinition of a RAM variable may mess up the first parse. e.g consider

        OK but no need to account for all the corner cases now.
        Let's walk before we dance.

        Charlie

         

Log in to post a comment.