Menu

Benchmarking thread

Developers
2005-03-11
2013-10-17
1 2 > >> (Page 1 of 2)
  • Charlie Zender

    Charlie Zender - 2005-03-11

    Hi All,

    We're beginning to work on benchmarking and profiling NCO.
    I thought it would be useful to create a thread on that here so
    everyone can see what we're doing and contribute ideas.
    This message describes the draft design of for a benchmark suite
    to measure arithemetic operator throughput with standard data
    processing. Harry is learning to use profiler tools if we want more
    precise timing breakdowns. Hopefully the operator-level benchmarks
    described below can be refined in the future in include some of
    these file/function/algorithm specific benchmarks.

    One design goal is to keep benchmarking portable so users at remote
    sites can run the benchmarking scripts on their own machines.
    That is why I prefer that benchmark input data files be created
    synthetically ('make benchmark'?) rather than, say, copying a bunch
    of GCM output files. Distributing large binary files to users
    (e.g., in debs and RPMs) is kludgy (and we wouldn't want to put such
    files in CVS).

    The first step is synthetic creation of benchmark input files.
    A short script to create, say, 1--10 GB of files in ${DATA_NCO}
    is needed. This script could be invoked with 'make benchmark', much
    like nco_tst.pl is invoked with 'make test'. It would manipulate
    a CDL file with ncgen and ncap to create the input files.

    We will benchmark many types of operations, so we will need
    different scales of input files to fully exercise the code.
    The CDL file ~/nco/data/big.cdl describes how to synthetically
    (i.e., with ncgen and ncap) create large files.
    One uncomments the appropriate dimension size in big.cdl, runs
    ncgen to create big.nc, then runs ncap.
    (in.cdl contains many many small variables and is not well-suited
    for benchmarking arithmetic, since real data files almost never have
    such small dimensions.)

    Next is the specification for two different input file shapes,
    one for climate model data and one for satellite data.
    big.cdl can serve as a template for the CDL of these files,
    and ncap can generate the corresponding fully populated netCDF files.
    We can add other files if desired in the future.
    For multi-file operators, it will suffice to benchmark operators
    with the same input file repeated N times, e.g.,
    ncra ipcc_dly_T85.nc ipcc_dly_T85.nc ipcc_dly_T85.nc ... out.nc

    The first benchmark file is a synthetic proxy for CCSM IPCC daily
    climate datasets.
    We can call these "T85 IPCC daily files", e.g., ipcc_dly_T85.nc.
    Typical dimensionality is

    Total file size: ~1.1 GB
    # variables per file: 128 (all NC_FLOAT for simplicity)
    # 4-D variables per file: 32
    # 3-D variables per file: 64
    # 2-D variables per file: 16
    # 1-D variables per file: 8
    # 0-D variables per file: 8

    4-D dimensionality (C ordering): [time,lev,lat,lon]
    3-D dimensionality (C ordering): [time,lat,lon]
    2-D dimensionality (C ordering): [lat,lon]
    1-D dimensionality (C ordering): [time]
    0-D dimensionality (C ordering): N/A
    Dimension sizes: time=8, lev=32, lat=128, lon=256

    I obtained this information from T42 monthly files in
    esmf.ess.uci.edu:/ptmp/zender/archive/T42x1_40/atm
    and scaled and rounded metadata to represent T85 daily resolution.

    Let's make a second synthetic file typical of high resolution
    satellite imagery.
    We can call these "Satellite 5km files", e.g., stl_5km.nc.
    Typical dimensionality of these files is:

    Total file size: ~300 MB
    # variables per file: 8 (all NC_FLOAT for simplicity)
    # 4-D variables per file: 0
    # 3-D variables per file: 0
    # 2-D variables per file: 8
    # 1-D variables per file: 0
    # 0-D variables per file: 0

    2-D dimensionality (C ordering): [lat,lon]
    Dimension sizes: lat=180*12=2160, lon=360*12=4320
    This is based on multi-spectral imaging data (e.g., MODIS visible
    channels) at 5 km ~ 5 arcminute resolution globally.

    Let's do the same benchmarks with ipcc_dly_T85.nc and stl_5km.nc.
    Here's the operator-level benchmark wishlist:
    1. ncwa averaging all variables to scalars
    2. ncra time-averaging 2^5 (i.e., one month) files
    3. ncea averaging 2^5 files
    4. ncbo differencing two files
    5. ncks multi-slabbing many separate regions (goal is to re-create the
       Martin Schultz scenario in TODO nco258)
    6. ncap long algebraic operation
    7. ncpdq packing the file
    8. ncpdq dimension-order reversal the file

    We'll need to some higher level controls which allow us to explore
    parameter space with the benchmark suite.
    The parameter space we'd like to explore includes:
    1. Files of different geometries of interest. Initial geometries
       of interest are represented by ipcc_dly_T85.nc and stl_5km.nc
       Default will be ipcc_dly_T85.nc.
    2. Shared memory performance, i.e., changing OMP_NUM_THREADS.
       Default will be OMP_NUM_THREADS=0 (i.e., no SMP).
    3. Performance through OPeNDAP servers.
       Input files should be allowed to be stored locally or on specified
       OPeNDAP server. Default is files stored locally.
       Default OPeNDAP server is initially soot.ess.uci.edu.
       Plan to add sand.ess.uci.edu and OptIPuter in a few weeks.

    Your comments, suggestions, extensions to this are welcome.

    Thanks,
    Charlie

     
    • Charlie Zender

      Charlie Zender - 2005-03-20

      Hi All,

      Let's conduct the benchmarking dialogues in the discussion forum
      so we create a self-maintaining archive of the problems and solutions.

      >      I'm running RedHat 9 on an 1700XP Athlon with 512MB memory , so the
      > test I can undertake are far less intensive than the ones you can run.

      This is OK for now.
      Henry, please get an account on the ESMF by following the directions
      at http://www.ess.uci.edu/esmf/accounts.html
      You'll need a UCInet ID first:
      http://www.nacs.uci.edu/ucinetid/

      Let's target the benchmarks for beefy 64 bit machines with > 1 GB RAM.
      In other words, lets attack the bottlenecks which slow down beefy
      machines rather than regular desktops.

      >    I tried to create nc from ipcc_dly_T85.cdl & stl_5km.cdl  - but ncgen
      > kept barfing ( don't think it was a a memory problem , Im using
      > netcdf-3.6.0-pl)  - so  I created large a huge  variable (268M)   using
      > left hand casting in ncap.

      This may be a RAM problem where RAM+SWAP < "enough".
      Unfortunately, ncgen does not put in user friendly error messages
      when malloc fails...

      >    The "bottle neck" in casting is the action in ncap_yacc.y for  
      > *out_var_exp  '=' scv_exp*

      >    More specifically  ncap_var_stretch()  &  ncap_var_write().
      >      ncap_var_stretch does its stuff and so the memory used is at least
      > 2*(sizeof  var) --
      >    We have two options here -- optimize  ncap_var_stretch() or replace
      > it in the action with simple loop. This is up to Charlie.  (scv_exp  is
      > a single value).

      Good idea. I will create a special code branch to handle this simple case.

      >    ncap_var_write() I also sluggish  with huge variables -- Its either
      > our code or netcdf lib  ? Will try to find out which it is

      Thanks,
      Charlie

       
    • Charlie Zender

      Charlie Zender - 2005-03-21

      Notes on NCO benchmark performance on ipcc_dly_T85:

      Benchmark performance on my laptop:
      My 1 GB laptop does not run the ipcc_dly_T85 create script.
      It bombs about 30 minutes into creating the 4D variables.
      Very useful error message about malloc(), not sure whether
      this indicates a memory leak or just RAM shortage.
      Harry, could you add memory usage diagnostics to make sure
      ncap does not leak during this long ncap command?
      Peak memory usage should not increase significantly within
      the 4D var creation commands.
      Is everything free'd() after each variable is created and before the
      next variable is created?
      We do want the benchmark script to run on small machines like our
      laptops, even though we will not optimize for small machines.

      Benchmark performance on ESMF:
      1. Create and store large data files in ${DATA} rather than ${HOME}.
         ${HOME} on many machines has quotas < 1 GB, and we don't want
         to inadvertently exceed usage, or cause test files to be backed up. 
      2. Creating *-out.nc from *.nc followed by mv *-out.nc *.nc is
         redundant because NCO always write to a temporary file anyway.
      3. Try ncapping with -h so that history attribute does not grow.
         Growing the history attribute may imposed performance penalty by
         increasing file metadata size.
         Since netCDF stores metadata first in file, so it has to move
         all subsequent data to make room for more metadata.
         That is why opening a file and replacing values in a variable
         is much faster than opening a file and adding some metadata.

      I modified Harry's ipcc_dly_T85.create.sh for #1 and #2 but
      kept the same otherwise (I did not try #3).
      Resulting script is in ~zender/nco/data on ESMF.
      Following are two results, before and after implementation
      of the ncap_var_stretch() patch Henry suggested.

      Disclaimer: I'm not really sure how to interpret 'time' output.
      Would someone explain to me why user+sys < 0.5*real?
      The patch appears to have improved user time performance by almost
      300% for the 4D variables :) Good call Henry!
      I've added the same optimization to nco_cnf_dmn(), which broadcasts
      variables in all non-ncap operators.
      I use a simple loop around memcpy() to fill a large buffer with
      the _same value_. This must be a very frequent task.
      Is there a faster way to fill a buffer with a single value?
      A different system call?

      A variation of this idea could be applied to the rank reducing
      operator functions, and would, e.g., speed up averaging (or
      minimization, maximization) of large 1-D variables by removing
      the expensive "reverse mapping" calculations from the inner loop over
      variable values. Same would apply to dimension permutation in ncpdq.
      Applying this speed-up in these cases would be a great way to learn
      the code and maybe find other optimizations in these algorithms which
      make heavy use of "reverse mapping".

      This illustrates the point that NCO is currently un-optimized for
      many arithmetic operations. We coded nearly everything to apply to the
      most complex case so that NCO would work in all situations.
      There are a lot of cherry picking optimizations that could speed up
      bottlenecks in special cases, e.g., one dimensional variables.
      Just designing the benchmarks has been very useful in highlighting the
      bottlenecks. All we've really done so far is try to create synthetic
      data for the benchmarks, imagine how much we'll learn when the
      benchmarks actually run....

      Thanks,
      Charlie

      **************************************************
      1. Before ncap_var_stretch() patch:
      **************************************************
      zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh

      Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

      real    0m17.92s
      user    0m1.68s
      sys     0m4.89s

      Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
      This takes ~ on a 2.8GHz Xeon - YMMV

      Starting to populate variables
      ncapping 32 4D vars takes:

      real    9m32.72s
      user    1m56.45s
      sys     1m0.90s
      ncapping 64 3D vars takes:

      real    1m46.02s
      user    0m9.47s
      sys     0m14.46s
      ncapping 16 2D vars takes:

      real    0m39.26s
      user    0m3.26s
      sys     0m8.98s
      ncapping 8 1D vars takes

      real    0m27.86s
      user    0m3.18s
      sys     0m8.54s
      ncapping 8 scalar vars takes:

      real    0m58.97s
      user    0m3.92s
      sys     0m8.70s

      **************************************************
      2. After ncap_var_stretch() patch:
      **************************************************
      Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

      real    0m6.87s
      user    0m1.60s
      sys     0m5.25s

      Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
      This takes ~ on a 2.8GHz Xeon - YMMV

      Starting to populate variables
      ncapping 32 4D vars takes:

      real    8m18.77s
      user    0m41.51s
      sys     1m1.26s
      ncapping 64 3D vars takes:

      real    1m39.99s
      user    0m6.40s
      sys     0m14.68s
      ncapping 16 2D vars takes:

      real    0m43.92s
      user    0m3.15s
      sys     0m8.70s
      ncapping 8 1D vars takes

      real    0m31.04s
      user    0m3.15s
      sys     0m8.60s
      ncapping 8 scalar vars takes:

      real    0m50.15s
      user    0m4.10s
      sys     0m9.03s

       
    • Charlie Zender

      Charlie Zender - 2005-03-22

      Hi All,

      I'm reviewing the ncap control structures to see how to speed it up.
      There are some straightforward changes that I'm making as I go along.
      I'll post these as I make them.

      1. Turns out the only NC_NOFILL in ncap.c was done in data mode
      after the variables were defined.
      Not sure why this did not cause an error.
      ncap was probably doing redundant data filling for the non-processed
      variables defined.
      Fixed that.

      2. The call to nco_aed_prc() (a metadata writing call) was
      un-necessarily late in the program. I moved it to just after
      nco_var_dfn(). This takes advantage of the netCDF feature that
      metadata (not data) can be manipulated before it is flushed to
      disk (with nc_enddef()). You can learn/use a variable's ID
      before populating the variable's data section.
      Now ncap ends just after writing the non-processed variables.
      This avoids the heavy disk penalty of moving variables to accomodate
      the increased metadata from the nco_aed_prc() call.

      Benchmark results after these two changes follow below.
      The results show no improvement over the previous benchmark.
      Oh well. Worth a try and the code is a little cleaner.
      My initial impression is that ncap with many dozens of
      statements to initialize large files is going to be slow as
      long as we do all the nc_var_def()'s within the statement loop
      rather than all at one time after an initial pass to compute
      the variable dimensionality but prior to populating the variables.

      Thanks,
      Charlie

      **************************************************
      3. After ncap.c NC_NOFILL and nco_aed_prc() moves:
      **************************************************
      zender@esmf04m:~/nco/data$  time ./ipcc_dly_T85.create.sh
      Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

      real    0m21.32s
      user    0m1.75s
      sys     0m4.64s

      Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
      This takes ~ on a 2.8GHz Xeon - YMMV

      Starting to populate variables
      ncapping 32 4D vars takes:

      real    20m41.78s
      user    0m41.54s
      sys     1m5.73s
      ncapping 64 3D vars takes:

      real    1m23.29s
      user    0m5.70s
      sys     0m12.11s
      ncapping 16 2D vars takes:

      real    0m9.75s
      user    0m2.43s
      sys     0m6.02s
      ncapping 8 1D vars takes

      real    0m17.95s
      user    0m2.37s
      sys     0m5.74s
      ncapping 8 scalar vars takes:

      real    0m33.59s
      user    0m2.83s
      sys     0m5.72s

       
    • Charlie Zender

      Charlie Zender - 2005-03-22

      Subject: Assessing and addressing ncap slowness, con't

      Hi All,

      Continuing to look for opportunities to speed up ncap:

      4. ncap_var_write() was not setting NC_NOFILL. Fixed that.

      5. Delayed creation of large variables until end, i.e.,
         defined the variables in reverse order of their size.
         This reduces the number of bytes that have to be moved each
         statement to make room for the new variable.
         This is not an ncap patch, just a re-ordering of
         ~/nco/data/ipcc_dly_T85.create.sh.

      Re-ran benchmark after ncap patches #4 and #5, results below.
      There is little noticeable change in user or sys time due to #4.
      Not too surprising after results from #3 earlier.
      The script change #5 did not change the user or sys times
      noticeably, although real time did seem to decrease by about three
      minutes or ~20%. Whether this was due to natural variability
      on a multi-user system or is because it have to copy less data
      is unproven, but I'd say it's enough reason to order the script
      so that small variables are defined first.

      Thanks,
      Charlie

      **************************************************
      4. After adding NC_NOFILL to ncap_utl.c nco_var_write()
      **************************************************
      zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
      Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

      real    0m6.92s
      user    0m1.94s
      sys     0m4.95s

      Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
      This takes ~ on a 2.8GHz Xeon - YMMV

      Starting to populate variables
      ncapping 32 4D vars takes:

      real    11m49.22s
      user    0m40.98s
      sys     0m59.94s
      ncapping 64 3D vars takes:

      real    1m36.79s
      user    0m5.80s
      sys     0m12.45s
      ncapping 16 2D vars takes:

      real    0m9.64s
      user    0m2.43s
      sys     0m6.30s
      ncapping 8 1D vars takes

      real    0m40.31s
      user    0m2.44s
      sys     0m5.94s
      ncapping 8 scalar vars takes:

      real    0m8.44s
      user    0m2.37s
      sys     0m6.05s

      **************************************************
      5. After reversing order of variable creation in ipcc_dly_T85.create.sh
      **************************************************

      zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
      Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

      real    0m7.09s
      user    0m2.01s
      sys     0m5.07s

      Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
      This takes ~ on a 2.8GHz Xeon - YMMV

      Starting to populate variables
      ncapping 8 scalar vars takes:

      real    0m44.10s
      user    0m2.60s
      sys     0m6.21s
      ncapping 8 1D vars takes

      real    0m8.91s
      user    0m2.56s
      sys     0m6.34s
      ncapping 16 2D vars takes:

      real    0m49.67s
      user    0m2.74s
      sys     0m6.20s
      ncapping 64 3D vars takes:

      real    1m25.92s
      user    0m5.96s
      sys     0m12.49s
      ncapping 32 4D vars takes:

      real    8m6.45s
      user    0m40.19s
      sys     1m0.07s

       
      • Harry Mangalam

        Harry Mangalam - 2005-03-22

        Hi Charlie,
        I think it might be b/c of variability - I meant to talk to you about this on the esmf - Ive recorded times from 11m to 23 min doing the ipcc test and you posted one time of >20 min as well, I think.  This seems quite variable, even considering it's a multiuser system with several jobs going.  The 'real' time should be quite variable, but the user and sys time should be pretty constant, I thought. 
        However, I did not do a check before and after of concurrent jobs.

         
        • Charlie Zender

          Charlie Zender - 2005-03-22

          Yes,  it could be variability.
          One way to rule that out is to get large statistics.
          Another way, on the ESMF anyway, would be to run the benchmark script on a dedicated CPU or node in the loadleveler queue and write to a node-local file (/tmp) rather than (/ptmp).
          real time performance under those conditions should be
          extremely stable.

          Charlie

           
          • Charlie Zender

            Charlie Zender - 2005-03-23

            Subject: getrusage() and valgrind to ncap's rescue

            Hi,

            There is new code in ncap_var_write() which activates a hook
            to print getrusage() resource usage reports when NCO is
            compiled with

            cd ~/nco/bld;make 'USR_TKN=-DNCO_RUSAGE_DBG';cd -

            I ran

            zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh > foo 2>&1

            with this hook activated and appended the output below.
            You see a lot of jibberish and some memory and timing info everytime
            a variable is written. I think the key diagnostic is

            rusage.ru_maxrss = maximum resident set size = ... [sz]

            maxrss appears to snapshot the memory devoted to the process
            (not its children, although that can also be got at).

            FYI there is an excellent quick tutorial on AIX memory profiling at
            http://www.arsc.edu/support/news/HPCnews/HPCnews293.shtml
            It shows that loadleveler is our friend, and also a tool called
            hpmcount which I have not used.
            Ergo we should make the benchmarks run-able under loadleveler
            eventually.

            For now, we see from the rusage diagnostics that
            ncap's maxrss monotonically increases with each variable written.
            Unless I am mistaken about maxrss, this is a memory leak.

            ulimit can be used to increase the amount of maxrss a machine
            gives a process, but ulimits are usually set to unlimited on Linux
            and the ESMF, so in practice fiddling with this won't get the
            benchmarks to run unless your PC has lots of RAM.
            According to the maxrss figures, completing the benchmark above
            requires 2 GB RAM just to get to d4_25! No wonder my lappie fails.

            The interesting question is: Where's the memory leak?
            I think valgrind should be able to find it:

            valgrind --leak-check=yes --suppressions=${HOME}/c++/valgrind.txt \ ncap -O -v -s 'd3_01[lat,lon,lev]=1.0;' -s 'd3_02[lat,lon,lev]=1.0;' \ ~/nco/data/in.nc ~/foo.nc > ~/foo 2>&1

            After wading through the chaff, valgrind says that

            ==15529== 384 bytes in 2 blocks are definitely lost in loss record 8 of 11
            ==15529==    at 0x1B90659D: malloc (vg_replace_malloc.c:130)
            ==15529==    by 0x805E7DB: nco_malloc_dbg (nco_mmr.c:149)
            ==15529==    by 0x806BD1F: nco_var_dpl (nco_var_utl.c:548)
            ==15529==    by 0x804D417: ncap_var_stretch (ncap_utl.c:942)
            ==15529==    by 0x804E5FE: yyparse (ncap_yacc.y:305)
            ==15529==    by 0x804B11E: main (ncap.c:535)

            It appears that LHS casting, at least, leaves an extra copy
            of the variable hanging around.
            Henry, would you please try finding the bug from here?

            Thanks,
            Charlie
            Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

            real    0m27.26s
            user    0m1.81s
            sys    0m4.89s

            Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
            This takes ~ on a 2.8GHz Xeon - YMMV

            Starting to populate variables
            ncapping 8 scalar vars takes:
            ncap: INFO ncap_var_write() writing variable weepy
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size =  13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable dopey
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size =  13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable sleepy
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size =  13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable grouchy
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size =  13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable sneezy
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size =  13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
            .
            .
            .
            ncap: INFO ncap_var_write() writing variable d4_22
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 23 s, rusage.ru_utime.tv_usec = user time used = 950000 us, rusage.ru_stime.tv_sec = system time used = 31 s, rusage.ru_stime.tv_usec = system time used = 180000 us, rusage.ru_maxrss = maximum resident set size = 1674384 [sz], rusage.ru_ixrss = integral shared memory size =  34850 [sz tm], rusage.ru_idrss = integral unshared data size = 26793945 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 469667, rusage.ru_majflt = page faults = 65686, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable d4_23
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 25 s, rusage.ru_utime.tv_usec = user time used = 580000 us, rusage.ru_stime.tv_sec = system time used = 33 s, rusage.ru_stime.tv_usec = system time used = 930000 us, rusage.ru_maxrss = maximum resident set size = 1739920 [sz], rusage.ru_ixrss = integral shared memory size =  37120 [sz tm], rusage.ru_idrss = integral unshared data size = 29763122 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 492729, rusage.ru_majflt = page faults = 74581, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable d4_24
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 27 s, rusage.ru_utime.tv_usec = user time used = 260000 us, rusage.ru_stime.tv_sec = system time used = 36 s, rusage.ru_stime.tv_usec = system time used = 520000 us, rusage.ru_maxrss = maximum resident set size = 1805456 [sz], rusage.ru_ixrss = integral shared memory size =  39456 [sz tm], rusage.ru_idrss = integral unshared data size = 32935037 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 515384, rusage.ru_majflt = page faults = 82745, rusage.ru_nswap = swaps = 0
            ncap: INFO ncap_var_write() writing variable d4_25
            ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
            ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 28 s, rusage.ru_utime.tv_usec = user time used = 830000 us, rusage.ru_stime.tv_sec = system time used = 39 s, rusage.ru_stime.tv_usec = system time used = 340000 us, rusage.ru_maxrss = maximum resident set size = 1870992 [sz], rusage.ru_ixrss = integral shared memory size =  41661 [sz tm], rusage.ru_idrs

             
            • Charlie Zender

              Charlie Zender - 2005-03-24

              Hi All,

              I think I plugged the memory leak in ncap_var_stretch().
              This leak was definitely exercised in LHS-casting.
              I re-ran the benchmark, results attached below.
              The maxrss size held steady at 232 MB.
              This is much less than the 2+ GB from before the patch.
              The benchmark may now run on laptops.
              There's no significant change in user or sys times,
              which is a bit disappointing.
              I'd hoped that old memory requirements were causing data
              to be swapped so often that reducing the requirements (by
              fixing the memory leak) would speed-up the benchmark, at
              least in wall-clock time. Alack and alas, 'tis not so.

              Each 4D variable in this benchmark is about 33 MB.
              The peak memory requirement (232 MB) seems large compared to 33 MB.
              We expect ncap to require at least 2*sizeof(var), yet ncap wants
              ~8*sizeof(var). Anyone have any ideas as to why?

              Thanks,
              Charlie

              Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

              real    0m6.33s
              user    0m1.59s
              sys    0m4.53s

              Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
              This takes ~ on a 2.8GHz Xeon - YMMV

              Starting to populate variables
              ncapping 8 scalar vars takes:

              real    0m41.68s
              user    0m2.41s
              sys    0m6.63s
              ncapping 8 1D vars takes

              real    0m27.12s
              user    0m2.52s
              sys    0m6.21s
              ncapping 16 2D vars takes:

              real    0m25.09s
              user    0m2.72s
              sys    0m6.35s
              ncapping 64 3D vars takes:

              real    1m26.77s
              user    0m5.99s
              sys    0m11.72s
              ncapping 32 4D vars takes:

              real    7m15.07s
              user    0m41.58s
              sys     0m53.46s

               
      • Charlie Zender

        Charlie Zender - 2005-03-22

        Subject: Assessing and addressing ncap slowness, summary

        Hi All,

        Well, that's all the ncap patches that cried out for fixing.

        Here's my evaluation:
        1. The ncap script which creates the synthetic datasets for
           benchmarking stresses the system with disk activity.
        2. Further significant improvements may require ncap redesign.
           Entering and exiting metadata mode for a variable definition within
           each statement is, I think, what kills ncap on long scripts.
           A memory leak may also play a role, and is worth pursuing.
        3. Alternative methods of benchmark file creation may be much quicker,
           e.g., ftp'ing them over the network, but we learn more from trying
           to create these files synthetically!
        4. Threading ncap (it is currently not threaded) is unlikely to fix
           the problem since much of the slowness appears to be due to I/O
           contention which threading would worsen, not improve.
           Threading ncap is non-trivial anyway, because of potential
           dependencies between statements it is not simple to thread over
           the statement loop.

        Harry is evaluating the memory usage of the benchmark script to
        verify whether there are any leaks with long ncap scripts.
        A leak is the likeliest explanation, I think, why the benchmark data
        creation produces malloc() errors on my 1 GB laptop.
        BTW, developer's forum 20031121 has nice thread on ncap memory usage.
        If we're lucky we may get some memory profiling options built into the
        configure script as options.

        Finally, I've implemented one more enhancement suggested by Harry:

        6. Use memset(vp,0,sz) to zero float and integer arrays in nco_var_zero() and nco_zero_long()

        Benchmark after patching for #6 follows.
        No noticeable improvment in user or sys times.

        Thanks,
        Charlie

        **************************************************
        6. After using memset() in nco_var_zero() and nco_zero_long()
        **************************************************
        zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
        Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template

        real    0m7.17s
        user    0m1.66s
        sys     0m5.49s

        Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
        This takes ~ on a 2.8GHz Xeon - YMMV

        Starting to populate variables
        ncapping 8 scalar vars takes:

        real    0m46.23s
        user    0m2.71s
        sys     0m6.03s
        ncapping 8 1D vars takes

        real    0m27.40s
        user    0m2.80s
        sys     0m5.89s
        ncapping 16 2D vars takes:

        real    0m10.46s
        user    0m2.74s
        sys     0m6.23s
        ncapping 64 3D vars takes:

        real    1m42.11s
        user    0m5.71s
        sys     0m12.26s
        ncapping 32 4D vars takes:

        real    8m47.94s
        user    0m40.49s
        sys     0m59.98s

         
    • Harry Mangalam

      Harry Mangalam - 2005-04-13

      On Tuesday 12 April 2005 6:01 pm, Charlie Zender wrote:(quoting Martin Schultz)

      > if I am counting coreeclty, this will be item 120 on the wish list,
      > but perhaps there is a chance to move it up if other people also
      > think this would be useful:
      >
      > in analyzing model output, I often want to extract time series at
      > specific locations (stations) from the huge 3D fields. I have
      > therefore written a couple of scripts, which basically call ncrcat
      > as follows:
      >
      > ncrcat -h -d lon,13.42 -d lat,52.53 hist*.nc berlin.nc
      > ncrcat -h -d lon,10.97 -d lat,47.4 hist*.nc zugspitze.nc
      >
      > etc.
      >
      > As there are about 50-60 stations which I am interested in, this
      > mutates into a hard disk test program (in fact, we have had to
      > replace about 4 hard drives after I had run this script for a while
      > ;-)

      So if I'm reading this right, you want the benchmarks to reflect this
      disk thrashing test, so that further improvements in the code will
      decrease this effect. However, until we're actually addressing this
      problem, do you want to thrash disks at every iteration of the
      benchmark test?  He was probably kidding about the killed disks, but
      it's a consideration with large, poorly tuned ops.

      > Of course, this kind of extraction could be done much more
      > efficiently if all stations would be extracted while the file is
      > opened only once. What would be needed is some sort of "list input"
      > option, specifying all the slabs to be processed and the respective
      > output names. To take things one step further, this could even
      > evolve into a mini "scripting" language, so that e.g. ncra could
      > compute averages and standard deviations in one pass through the
      > data.

      the above sounds like what you've implemented in ncap - is the
      scripting stuff modular enough that it can be called in all of the
      operators or would it be easier to have this functionality
      implemented in ncap? 

      > The calling format could be something like
      > nccommand -list_input listfile in.nc
      > and the listfile would list the other options and the output files:
      > -d lon,13.42 -d lat,52.53 berlin.nc
      > -d lon,10.97 -d lat,47.4 zugspitze.nc
      >
      > Implicitly, this would probably require -A to be set automatically,
      > so that the output files are automatically continued (at least for
      > ncrcat; ncks might want to ask the user instead).

       
      • Charlie Zender

        Charlie Zender - 2005-04-13

        Hi Harry,

        >>in analyzing model output, I often want to extract time series at
        >>specific locations (stations) from the huge 3D fields. I have
        >>therefore written a couple of scripts, which basically call ncrcat
        >>as follows:
        >>
        >>ncrcat -h -d lon,13.42 -d lat,52.53 hist*.nc berlin.nc
        >>ncrcat -h -d lon,10.97 -d lat,47.4 hist*.nc zugspitze.nc

        FYI, there is a large and growing community of people who compare
        global models (or measurements) to station data, so this is a common
        situation. I have similar scripts that extract data from ~200 AERONET
        stations to compare models against measurements...

        >>As there are about 50-60 stations which I am interested in, this
        >>mutates into a hard disk test program (in fact, we have had to
        >>replace about 4 hard drives after I had run this script for a while
        >>;-)

        > So if I'm reading this right, you want the benchmarks to reflect this
        > disk thrashing test, so that further improvements in the code will
        > decrease this effect. However, until we're actually addressing this
        > problem, do you want to thrash disks at every iteration of the
        > benchmark test?  He was probably kidding about the killed disks, but
        > it's a consideration with large, poorly tuned ops.

        Yes, that's a good point. Let's keep this out of the benchmarks for now.
        Let's remember that this is something NCO is bad at, and thus an
        opportunity for future improvements.

        >>Of course, this kind of extraction could be done much more
        >>efficiently if all stations would be extracted while the file is
        >>opened only once. What would be needed is some sort of "list input"
        >>option, specifying all the slabs to be processed and the respective
        >>output names. To take things one step further, this could even
        >>evolve into a mini "scripting" language, so that e.g. ncra could
        >>compute averages and standard deviations in one pass through the
        >>data.

        > the above sounds like what you've implemented in ncap - is the
        > scripting stuff modular enough that it can be called in all of the
        > operators or would it be easier to have this functionality
        > implemented in ncap? 

        The needs of the Original Poster (Martin Schultz) cannot be met by
        ncap in anything like its present form because ncap lacks two crucial
        features: ability to handle multiple-input files (it is not an MFO =
        multi-file operator), nor does it allow hyperslabbing.
        After further consideration, the OP's suggestion of a new, "list
        input" for hyperslabbing syntax makes good sense as an addition,
        not a replacement, to the current multi-slabbing algorithm (MSA).
        Such syntax would be very useful in the MFOs ncrcat and ncecat.

        It is debatable whether creating multiple output files in one
        invocation would be preferable to storing all the specified hyperslabs
        in one output file with a new dimension to glue together hyperslabs.
        ncecat creates a new dimension to glue together its input files, so
        the precedent is clear. The user could then hyperslab that single
        output file along the region dimension to create the final set of
        files.

        People should read my response to the OP on the Open Discussion forum
        on 2003-05-06. It's important to understand that the current MSA does
        allow multiple hyperslabs, but it has no way of linking a specific
        hyperslab in one dimension, e.g., lat, to a specific hyperslab along
        another dimension, e.g., lon, so it collects hyperslabs for each
        lat,lon combination instead.

        Charlie

         
    • Nobody/Anonymous

      Hi Charlie & harry,
      Have coded some stuff that optimizes ncap -- Its a bit rough at the moment. Would it be possible to put it into a separate CVS branch untill its fully tested -- if so how ?
      Files changed
      ncap.h
      ncap.c
      ncap_utl.c
      ncap_yacc.y
      ncap_lex.l
      Regards Henry

       
      • Harry Mangalam

        Harry Mangalam - 2005-04-14

        Hi Henry,
        That'sgreat!  I can create a branch for you or you can - the process is pretty simple:
        - check out the base current code (should give it a different name tho - maybe nco_ncap-X with the -d opt

        - cd into the newly checked out tree and tag it with a new name - maybe '2005_04_14_ncap-X_branch'

        The free RedBean CVS book (in pdf) describes this in detail, starting at p73.
        http://cvsbook.red-bean.com/

        You can do it or I can do it and then send you the checkout info. Lemme know.

         
    • Nobody/Anonymous

      Hi Harry,
      My optimized code has a bug - -will try to fix& release  this weekend
      Regards Henry

       
    • Nobody/Anonymous

      Hi Harry & Charlie,
      What I've done to optimize  ncap is to parse the script twice.
      In the first parse variables are "wrtten to memory" rather than to disk. After this the list of variables are defined in the output.
      In the second parse these variables are populated with real variables. To realize this - means in practice wrting code around ncap_var_init() and ncap_var_write(). Im so close to completion it hurts -- I just need to solve some memory problems -- ncap.in almost runs

      The advantages of my approach is that  if/when we add dimension altering functions the code will still be able to calcuate of the dims/size  of the final vars. One  problem is
      double the memory leakage

      ( this is different from the old code where the script was scanned once then parsed once)

      Regards Henry

       
    • Charlie Zender

      Charlie Zender - 2005-04-18

      Hi Henry,

      The algorithm you describe sounds distinctly different than the one
      I proposed in the "ncap top level loop re-design" thread
      http://sourceforge.net/forum/message.php?msg_id=3073430
      which is where I wil follow up with some questions....

      Charlie

       
    • Harry Mangalam

      Harry Mangalam - 2005-04-19

      Hi All,

      After much more time than I would have thunk, I've committed an updated script (<nco_root>/bm/nco_bm.pl) that combines the  regression testing, file creationism & ncapping, and benchmarking.  It's not yet integrated into the Makefiles yet, but it will be very shortly (really).  Until then it can be used to manually run and generate regression results and benchmarks.  Call it by itself to get a usage description.
      NB: as of this time, on LInux x86, the regression testing passes everything but ncap, and completes the benchmark.  On Linux AMD64, it fails another regression test (which it should) that mysteriously passes on Linux x86 and also completes the benchmark.
      On AIX, it fails just about everything miserably, so there's some code examination  that needs to be done.

       
    • Harry Mangalam

      Harry Mangalam - 2005-04-22

      This may be known well by everyone else, but I just found it.
      Valgrind is a very useful tool for memory checking, but the underlying structure (a synthetic x86 CPU) allows it to also act as a profiler.  It comes with 'callgrind', a calltree/profiler that can extract a surprising amount of info that you can then visualize with the kcachegrind app.  Using this approach to an app that I (thought I) knew well.  It was quite revealing to see where the real time usage was going on.  One thing that I had spent a huge amount of time optimizing because I thought it would be the core CPU sink used less than 6% of the time (tho maybe because I HAD spent so much time optimizing it).  Calls to Memory allocation were much more expensive.

      Additionally, using the callgrind approach is very nice becasue you don't have to instrument the code in any way, and it runs fairly quickly compared to using the -pg flag in the compiler (and doesn't have to be re-compiled to UNinstrument it).

      Granted, this only describes what happens in an x86 but I'd imagine that the overview is probably quite useful in general.  Note that valgrind itself can run on any architecture, but it emulates an x86.

      harry

       
      • Charlie Zender

        Charlie Zender - 2005-04-23

        I am very curious what any profiler says about the benchmarks.
        Is valgrind fast-enough to permit profiling the benchmark script?
        Where does ncra spend its time inside the benchmark script?

        Thanks,
        Charlie

         
        • Harry Mangalam

          Harry Mangalam - 2005-04-23

          I think  so - I did the set-up and testing  work on my laptop - I'll run the ncra bench tomorrow on soot or sand (when it'll be raining...). I thought I saw that it exists for amd64

          It looks like it takes about 5 times longer for the callgrind runs so that;s easily do-able.

          hjm

           
    • Nobody/Anonymous

      Hi Charlie & Harry,
      Have tried to optimize LHS casting
      . It work Ok with a scalar on the RHS but is tempramental  with vars on the RHS especially zero dimensional vars.
      will keep working away

      Regards Henry

       
      • Charlie Zender

        Charlie Zender - 2005-05-04

        Hi All,

        FYI, I changed the prototype of ncap_var_init() to

        var_sct *ncap_var_init(const char * const,prs_sct *);

        to avoid breakage on Henry's most recent ncap commit.

        Charlie

        gcc-3.4 -std=c99 -pedantic -D_BSD_SOURCE  -DLINUX -DNO_NETCDF_2 -DVERSION='20050504' -DHOSTNAME='ashes' -DUSER='zender' -DNCO_ABORT_ON_ERROR -DHAVE_REGEX_H -DNCO_HAVE_REGEX_FUNCTIONALITY -DHAVE_MKSTEMP -DHAVE_GETOPT_H -DHAVE_GETOPT_LONG -DENABLE_UDUNITS -DHAVE_UDUNITS_H  -Di386 -I./ -I../src/nco -I/usr/local/include -I/usr/local/include  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wall -O -g -march=pentium4  -c ../src/nco/ncap_utl.c -o /home/zender/obj/LINUX/ncap_utl.o
        ../src/nco/ncap_utl.c:15: error: conflicting types for 'ncap_var_init'
        ../src/nco/ncap.h:208: error: previous declaration of 'ncap_var_init' was here
        ../src/nco/ncap_utl.c:15: error: conflicting types for 'ncap_var_init'
        ../src/nco/ncap.h:208: error: previous declaration of 'ncap_var_init' was here
        ../src/nco/ncap_utl.c: In function `ncap_var_write':
        ../src/nco/ncap_utl.c:162: warning: unused variable `fll_md_old'
        m

         
    • Nobody/Anonymous

      Hi All,
      Have commited corrections to ncap_*.[ch] & nco_cnf_typ.c , so double parsing now works.
      LHS casting should be even faster now (removed initialization of var data in var_LHS on the first pass)
      Please benchmark away.
      N.B this is not the finished article
      Regards Henry

       
    • Charlie Zender

      Charlie Zender - 2005-05-07

      Hi Henry,

      FYI your modifications to ncap sometime in the last two weeks appears to have caused a regression with LHS casting. You may
      already know this, but in case you don't, the regression actually
      shows up in one of the ncks tests:

      /home/zender/bin/LINUX/ncks -O -v lat_T42,lon_T42,gw_T42 in.nc foo_T42.nc
      /home/zender/bin/LINUX/ncrename -O -d lat_T42,lat -d lon_T42,lon -v lat_T42,lat -v gw_T42,gw -v lon_T42,lon foo_T42.nc
      /home/zender/bin/LINUX/ncap -O -s "one[lat,lon]=lat*lon*0.0+1.0" -s "zero[lat,lon]=lat*lon*0.0" foo_T42.nc foo_T42.nc
      /home/zender/bin/LINUX/ncks -C -H -s "%g" -v one -F -d lon,128 -d lat,64 foo_T42.nc

      The LHS casting of lat*lon fails where it used to succeed.

      Thanks,
      Charlie

       
1 2 > >> (Page 1 of 2)

Log in to post a comment.