NCO netCDF Operators / Discussion / Developers: Benchmarking thread

Charlie Zender - 2005-03-11

Hi All,

We're beginning to work on benchmarking and profiling NCO.
I thought it would be useful to create a thread on that here so
everyone can see what we're doing and contribute ideas.
This message describes the draft design of for a benchmark suite
to measure arithemetic operator throughput with standard data
processing. Harry is learning to use profiler tools if we want more
precise timing breakdowns. Hopefully the operator-level benchmarks
described below can be refined in the future in include some of
these file/function/algorithm specific benchmarks.

One design goal is to keep benchmarking portable so users at remote
sites can run the benchmarking scripts on their own machines.
That is why I prefer that benchmark input data files be created
synthetically ('make benchmark'?) rather than, say, copying a bunch
of GCM output files. Distributing large binary files to users
(e.g., in debs and RPMs) is kludgy (and we wouldn't want to put such
files in CVS).

The first step is synthetic creation of benchmark input files.
A short script to create, say, 1--10 GB of files in ${DATA_NCO}
is needed. This script could be invoked with 'make benchmark', much
like nco_tst.pl is invoked with 'make test'. It would manipulate
a CDL file with ncgen and ncap to create the input files.

We will benchmark many types of operations, so we will need
different scales of input files to fully exercise the code.
The CDL file ~/nco/data/big.cdl describes how to synthetically
(i.e., with ncgen and ncap) create large files.
One uncomments the appropriate dimension size in big.cdl, runs
ncgen to create big.nc, then runs ncap.
(in.cdl contains many many small variables and is not well-suited
for benchmarking arithmetic, since real data files almost never have
such small dimensions.)

Next is the specification for two different input file shapes,
one for climate model data and one for satellite data.
big.cdl can serve as a template for the CDL of these files,
and ncap can generate the corresponding fully populated netCDF files.
We can add other files if desired in the future.
For multi-file operators, it will suffice to benchmark operators
with the same input file repeated N times, e.g.,
ncra ipcc_dly_T85.nc ipcc_dly_T85.nc ipcc_dly_T85.nc ... out.nc

The first benchmark file is a synthetic proxy for CCSM IPCC daily
climate datasets.
We can call these "T85 IPCC daily files", e.g., ipcc_dly_T85.nc.
Typical dimensionality is

Total file size: ~1.1 GB
# variables per file: 128 (all NC_FLOAT for simplicity)
# 4-D variables per file: 32
# 3-D variables per file: 64
# 2-D variables per file: 16
# 1-D variables per file: 8
# 0-D variables per file: 8

4-D dimensionality (C ordering): [time,lev,lat,lon]
3-D dimensionality (C ordering): [time,lat,lon]
2-D dimensionality (C ordering): [lat,lon]
1-D dimensionality (C ordering): [time]
0-D dimensionality (C ordering): N/A
Dimension sizes: time=8, lev=32, lat=128, lon=256

I obtained this information from T42 monthly files in
esmf.ess.uci.edu:/ptmp/zender/archive/T42x1_40/atm
and scaled and rounded metadata to represent T85 daily resolution.

Let's make a second synthetic file typical of high resolution
satellite imagery.
We can call these "Satellite 5km files", e.g., stl_5km.nc.
Typical dimensionality of these files is:

Total file size: ~300 MB
# variables per file: 8 (all NC_FLOAT for simplicity)
# 4-D variables per file: 0
# 3-D variables per file: 0
# 2-D variables per file: 8
# 1-D variables per file: 0
# 0-D variables per file: 0

2-D dimensionality (C ordering): [lat,lon]
Dimension sizes: lat=180*12=2160, lon=360*12=4320
This is based on multi-spectral imaging data (e.g., MODIS visible
channels) at 5 km ~ 5 arcminute resolution globally.

Let's do the same benchmarks with ipcc_dly_T85.nc and stl_5km.nc.
Here's the operator-level benchmark wishlist:
1. ncwa averaging all variables to scalars
2. ncra time-averaging 2^5 (i.e., one month) files
3. ncea averaging 2^5 files
4. ncbo differencing two files
5. ncks multi-slabbing many separate regions (goal is to re-create the
   Martin Schultz scenario in TODO nco258)
6. ncap long algebraic operation
7. ncpdq packing the file
8. ncpdq dimension-order reversal the file

We'll need to some higher level controls which allow us to explore
parameter space with the benchmark suite.
The parameter space we'd like to explore includes:
1. Files of different geometries of interest. Initial geometries
   of interest are represented by ipcc_dly_T85.nc and stl_5km.nc
   Default will be ipcc_dly_T85.nc.
2. Shared memory performance, i.e., changing OMP_NUM_THREADS.
   Default will be OMP_NUM_THREADS=0 (i.e., no SMP).
3. Performance through OPeNDAP servers.
   Input files should be allowed to be stored locally or on specified
   OPeNDAP server. Default is files stored locally.
   Default OPeNDAP server is initially soot.ess.uci.edu.
   Plan to add sand.ess.uci.edu and OptIPuter in a few weeks.

Your comments, suggestions, extensions to this are welcome.

Thanks,
Charlie

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-03-20
  
  Hi All,
  
  Let's conduct the benchmarking dialogues in the discussion forum
  so we create a self-maintaining archive of the problems and solutions.
  
  >      I'm running RedHat 9 on an 1700XP Athlon with 512MB memory , so the
  > test I can undertake are far less intensive than the ones you can run.
  
  This is OK for now.
  Henry, please get an account on the ESMF by following the directions
  at http://www.ess.uci.edu/esmf/accounts.html
  You'll need a UCInet ID first:
  http://www.nacs.uci.edu/ucinetid/
  
  Let's target the benchmarks for beefy 64 bit machines with > 1 GB RAM.
  In other words, lets attack the bottlenecks which slow down beefy
  machines rather than regular desktops.
  
  >    I tried to create nc from ipcc_dly_T85.cdl & stl_5km.cdl - but ncgen
  > kept barfing ( don't think it was a a memory problem , Im using
  > netcdf-3.6.0-pl) - so I created large a huge variable (268M)   using
  > left hand casting in ncap.
  
  This may be a RAM problem where RAM+SWAP < "enough".
  Unfortunately, ncgen does not put in user friendly error messages
  when malloc fails...
  
  >    The "bottle neck" in casting is the action in ncap_yacc.y for
  > *out_var_exp '=' scv_exp*
  
  >    More specifically ncap_var_stretch() & ncap_var_write().
  >      ncap_var_stretch does its stuff and so the memory used is at least
  > 2*(sizeof var) --
  >    We have two options here -- optimize ncap_var_stretch() or replace
  > it in the action with simple loop. This is up to Charlie. (scv_exp is
  > a single value).
  
  Good idea. I will create a special code branch to handle this simple case.
  
  >    ncap_var_write() I also sluggish with huge variables -- Its either
  > our code or netcdf lib ? Will try to find out which it is
  
  Thanks,
  Charlie
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-03-21
  
  Notes on NCO benchmark performance on ipcc_dly_T85:
  
  Benchmark performance on my laptop:
  My 1 GB laptop does not run the ipcc_dly_T85 create script.
  It bombs about 30 minutes into creating the 4D variables.
  Very useful error message about malloc(), not sure whether
  this indicates a memory leak or just RAM shortage.
  Harry, could you add memory usage diagnostics to make sure
  ncap does not leak during this long ncap command?
  Peak memory usage should not increase significantly within
  the 4D var creation commands.
  Is everything free'd() after each variable is created and before the
  next variable is created?
  We do want the benchmark script to run on small machines like our
  laptops, even though we will not optimize for small machines.
  
  Benchmark performance on ESMF:
  1. Create and store large data files in ${DATA} rather than ${HOME}.
     ${HOME} on many machines has quotas < 1 GB, and we don't want
     to inadvertently exceed usage, or cause test files to be backed up.
  2. Creating *-out.nc from *.nc followed by mv *-out.nc *.nc is
     redundant because NCO always write to a temporary file anyway.
  3. Try ncapping with -h so that history attribute does not grow.
     Growing the history attribute may imposed performance penalty by
     increasing file metadata size.
     Since netCDF stores metadata first in file, so it has to move
     all subsequent data to make room for more metadata.
     That is why opening a file and replacing values in a variable
     is much faster than opening a file and adding some metadata.
  
  I modified Harry's ipcc_dly_T85.create.sh for #1 and #2 but
  kept the same otherwise (I did not try #3).
  Resulting script is in ~zender/nco/data on ESMF.
  Following are two results, before and after implementation
  of the ncap_var_stretch() patch Henry suggested.
  
  Disclaimer: I'm not really sure how to interpret 'time' output.
  Would someone explain to me why user+sys < 0.5*real?
  The patch appears to have improved user time performance by almost
  300% for the 4D variables :) Good call Henry!
  I've added the same optimization to nco_cnf_dmn(), which broadcasts
  variables in all non-ncap operators.
  I use a simple loop around memcpy() to fill a large buffer with
  the _same value_. This must be a very frequent task.
  Is there a faster way to fill a buffer with a single value?
  A different system call?
  
  A variation of this idea could be applied to the rank reducing
  operator functions, and would, e.g., speed up averaging (or
  minimization, maximization) of large 1-D variables by removing
  the expensive "reverse mapping" calculations from the inner loop over
  variable values. Same would apply to dimension permutation in ncpdq.
  Applying this speed-up in these cases would be a great way to learn
  the code and maybe find other optimizations in these algorithms which
  make heavy use of "reverse mapping".
  
  This illustrates the point that NCO is currently un-optimized for
  many arithmetic operations. We coded nearly everything to apply to the
  most complex case so that NCO would work in all situations.
  There are a lot of cherry picking optimizations that could speed up
  bottlenecks in special cases, e.g., one dimensional variables.
  Just designing the benchmarks has been very useful in highlighting the
  bottlenecks. All we've really done so far is try to create synthetic
  data for the benchmarks, imagine how much we'll learn when the
  benchmarks actually run....
  
  Thanks,
  Charlie
  
  **************************************************
  1. Before ncap_var_stretch() patch:
  **************************************************
  zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
  
  Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
  
  real    0m17.92s
  user    0m1.68s
  sys     0m4.89s
  
  Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
  This takes ~ on a 2.8GHz Xeon - YMMV
  
  Starting to populate variables
  ncapping 32 4D vars takes:
  
  real    9m32.72s
  user    1m56.45s
  sys     1m0.90s
  ncapping 64 3D vars takes:
  
  real    1m46.02s
  user    0m9.47s
  sys     0m14.46s
  ncapping 16 2D vars takes:
  
  real    0m39.26s
  user    0m3.26s
  sys     0m8.98s
  ncapping 8 1D vars takes
  
  real    0m27.86s
  user    0m3.18s
  sys     0m8.54s
  ncapping 8 scalar vars takes:
  
  real    0m58.97s
  user    0m3.92s
  sys     0m8.70s
  
  **************************************************
  2. After ncap_var_stretch() patch:
  **************************************************
  Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
  
  real    0m6.87s
  user    0m1.60s
  sys     0m5.25s
  
  Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
  This takes ~ on a 2.8GHz Xeon - YMMV
  
  Starting to populate variables
  ncapping 32 4D vars takes:
  
  real    8m18.77s
  user    0m41.51s
  sys     1m1.26s
  ncapping 64 3D vars takes:
  
  real    1m39.99s
  user    0m6.40s
  sys     0m14.68s
  ncapping 16 2D vars takes:
  
  real    0m43.92s
  user    0m3.15s
  sys     0m8.70s
  ncapping 8 1D vars takes
  
  real    0m31.04s
  user    0m3.15s
  sys     0m8.60s
  ncapping 8 scalar vars takes:
  
  real    0m50.15s
  user    0m4.10s
  sys     0m9.03s
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-03-22
  
  Hi All,
  
  I'm reviewing the ncap control structures to see how to speed it up.
  There are some straightforward changes that I'm making as I go along.
  I'll post these as I make them.
  
  1. Turns out the only NC_NOFILL in ncap.c was done in data mode
  after the variables were defined.
  Not sure why this did not cause an error.
  ncap was probably doing redundant data filling for the non-processed
  variables defined.
  Fixed that.
  
  2. The call to nco_aed_prc() (a metadata writing call) was
  un-necessarily late in the program. I moved it to just after
  nco_var_dfn(). This takes advantage of the netCDF feature that
  metadata (not data) can be manipulated before it is flushed to
  disk (with nc_enddef()). You can learn/use a variable's ID
  before populating the variable's data section.
  Now ncap ends just after writing the non-processed variables.
  This avoids the heavy disk penalty of moving variables to accomodate
  the increased metadata from the nco_aed_prc() call.
  
  Benchmark results after these two changes follow below.
  The results show no improvement over the previous benchmark.
  Oh well. Worth a try and the code is a little cleaner.
  My initial impression is that ncap with many dozens of
  statements to initialize large files is going to be slow as
  long as we do all the nc_var_def()'s within the statement loop
  rather than all at one time after an initial pass to compute
  the variable dimensionality but prior to populating the variables.
  
  Thanks,
  Charlie
  
  **************************************************
  3. After ncap.c NC_NOFILL and nco_aed_prc() moves:
  **************************************************
  zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
  Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
  
  real    0m21.32s
  user    0m1.75s
  sys     0m4.64s
  
  Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
  This takes ~ on a 2.8GHz Xeon - YMMV
  
  Starting to populate variables
  ncapping 32 4D vars takes:
  
  real    20m41.78s
  user    0m41.54s
  sys     1m5.73s
  ncapping 64 3D vars takes:
  
  real    1m23.29s
  user    0m5.70s
  sys     0m12.11s
  ncapping 16 2D vars takes:
  
  real    0m9.75s
  user    0m2.43s
  sys     0m6.02s
  ncapping 8 1D vars takes
  
  real    0m17.95s
  user    0m2.37s
  sys     0m5.74s
  ncapping 8 scalar vars takes:
  
  real    0m33.59s
  user    0m2.83s
  sys     0m5.72s
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-03-22
  
  Subject: Assessing and addressing ncap slowness, con't
  
  Hi All,
  
  Continuing to look for opportunities to speed up ncap:
  
  4. ncap_var_write() was not setting NC_NOFILL. Fixed that.
  
  5. Delayed creation of large variables until end, i.e.,
     defined the variables in reverse order of their size.
     This reduces the number of bytes that have to be moved each
     statement to make room for the new variable.
     This is not an ncap patch, just a re-ordering of
     ~/nco/data/ipcc_dly_T85.create.sh.
  
  Re-ran benchmark after ncap patches #4 and #5, results below.
  There is little noticeable change in user or sys time due to #4.
  Not too surprising after results from #3 earlier.
  The script change #5 did not change the user or sys times
  noticeably, although real time did seem to decrease by about three
  minutes or ~20%. Whether this was due to natural variability
  on a multi-user system or is because it have to copy less data
  is unproven, but I'd say it's enough reason to order the script
  so that small variables are defined first.
  
  Thanks,
  Charlie
  
  **************************************************
  4. After adding NC_NOFILL to ncap_utl.c nco_var_write()
  **************************************************
  zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
  Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
  
  real    0m6.92s
  user    0m1.94s
  sys     0m4.95s
  
  Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
  This takes ~ on a 2.8GHz Xeon - YMMV
  
  Starting to populate variables
  ncapping 32 4D vars takes:
  
  real    11m49.22s
  user    0m40.98s
  sys     0m59.94s
  ncapping 64 3D vars takes:
  
  real    1m36.79s
  user    0m5.80s
  sys     0m12.45s
  ncapping 16 2D vars takes:
  
  real    0m9.64s
  user    0m2.43s
  sys     0m6.30s
  ncapping 8 1D vars takes
  
  real    0m40.31s
  user    0m2.44s
  sys     0m5.94s
  ncapping 8 scalar vars takes:
  
  real    0m8.44s
  user    0m2.37s
  sys     0m6.05s
  
  **************************************************
  5. After reversing order of variable creation in ipcc_dly_T85.create.sh
  **************************************************
  
  zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
  Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
  
  real    0m7.09s
  user    0m2.01s
  sys     0m5.07s
  
  Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
  This takes ~ on a 2.8GHz Xeon - YMMV
  
  Starting to populate variables
  ncapping 8 scalar vars takes:
  
  real    0m44.10s
  user    0m2.60s
  sys     0m6.21s
  ncapping 8 1D vars takes
  
  real    0m8.91s
  user    0m2.56s
  sys     0m6.34s
  ncapping 16 2D vars takes:
  
  real    0m49.67s
  user    0m2.74s
  sys     0m6.20s
  ncapping 64 3D vars takes:
  
  real    1m25.92s
  user    0m5.96s
  sys     0m12.49s
  ncapping 32 4D vars takes:
  
  real    8m6.45s
  user    0m40.19s
  sys     1m0.07s
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Harry Mangalam - 2005-03-22
    
    Hi Charlie,
    I think it might be b/c of variability - I meant to talk to you about this on the esmf - Ive recorded times from 11m to 23 min doing the ipcc test and you posted one time of >20 min as well, I think. This seems quite variable, even considering it's a multiuser system with several jobs going. The 'real' time should be quite variable, but the user and sys time should be pretty constant, I thought.
    However, I did not do a check before and after of concurrent jobs.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Charlie Zender - 2005-03-22
      
      Yes, it could be variability.
      One way to rule that out is to get large statistics.
      Another way, on the ESMF anyway, would be to run the benchmark script on a dedicated CPU or node in the loadleveler queue and write to a node-local file (/tmp) rather than (/ptmp).
      real time performance under those conditions should be
      extremely stable.
      
      Charlie
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Charlie Zender - 2005-03-23
        
        Subject: getrusage() and valgrind to ncap's rescue
        
        Hi,
        
        There is new code in ncap_var_write() which activates a hook
        to print getrusage() resource usage reports when NCO is
        compiled with
        
        cd ~/nco/bld;make 'USR_TKN=-DNCO_RUSAGE_DBG';cd -
        
        I ran
        
        zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh > foo 2>&1
        
        with this hook activated and appended the output below.
        You see a lot of jibberish and some memory and timing info everytime
        a variable is written. I think the key diagnostic is
        
        rusage.ru_maxrss = maximum resident set size = ... [sz]
        
        maxrss appears to snapshot the memory devoted to the process
        (not its children, although that can also be got at).
        
        FYI there is an excellent quick tutorial on AIX memory profiling at
        http://www.arsc.edu/support/news/HPCnews/HPCnews293.shtml
        It shows that loadleveler is our friend, and also a tool called
        hpmcount which I have not used.
        Ergo we should make the benchmarks run-able under loadleveler
        eventually.
        
        For now, we see from the rusage diagnostics that
        ncap's maxrss monotonically increases with each variable written.
        Unless I am mistaken about maxrss, this is a memory leak.
        
        ulimit can be used to increase the amount of maxrss a machine
        gives a process, but ulimits are usually set to unlimited on Linux
        and the ESMF, so in practice fiddling with this won't get the
        benchmarks to run unless your PC has lots of RAM.
        According to the maxrss figures, completing the benchmark above
        requires 2 GB RAM just to get to d4_25! No wonder my lappie fails.
        
        The interesting question is: Where's the memory leak?
        I think valgrind should be able to find it:
        
        valgrind --leak-check=yes --suppressions=${HOME}/c++/valgrind.txt \ ncap -O -v -s 'd3_01[lat,lon,lev]=1.0;' -s 'd3_02[lat,lon,lev]=1.0;' \ ~/nco/data/in.nc ~/foo.nc > ~/foo 2>&1
        
        After wading through the chaff, valgrind says that
        
        ==15529== 384 bytes in 2 blocks are definitely lost in loss record 8 of 11
        ==15529==    at 0x1B90659D: malloc (vg_replace_malloc.c:130)
        ==15529==    by 0x805E7DB: nco_malloc_dbg (nco_mmr.c:149)
        ==15529==    by 0x806BD1F: nco_var_dpl (nco_var_utl.c:548)
        ==15529==    by 0x804D417: ncap_var_stretch (ncap_utl.c:942)
        ==15529==    by 0x804E5FE: yyparse (ncap_yacc.y:305)
        ==15529==    by 0x804B11E: main (ncap.c:535)
        
        It appears that LHS casting, at least, leaves an extra copy
        of the variable hanging around.
        Henry, would you please try finding the bug from here?
        
        Thanks,
        Charlie
        Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
        
        real    0m27.26s
        user    0m1.81s
        sys    0m4.89s
        
        Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
        This takes ~ on a 2.8GHz Xeon - YMMV
        
        Starting to populate variables
        ncapping 8 scalar vars takes:
        ncap: INFO ncap_var_write() writing variable weepy
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size = 13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable dopey
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size = 13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable sleepy
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size = 13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable grouchy
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size = 13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable sneezy
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 0 s, rusage.ru_utime.tv_usec = user time used = 0 us, rusage.ru_stime.tv_sec = system time used = 0 s, rusage.ru_stime.tv_usec = system time used = 10000 us, rusage.ru_maxrss = maximum resident set size = 3172 [sz], rusage.ru_ixrss = integral shared memory size = 13 [sz tm], rusage.ru_idrss = integral unshared data size = 2 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 557, rusage.ru_majflt = page faults = 8, rusage.ru_nswap = swaps = 0
        .
        .
        .
        ncap: INFO ncap_var_write() writing variable d4_22
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 23 s, rusage.ru_utime.tv_usec = user time used = 950000 us, rusage.ru_stime.tv_sec = system time used = 31 s, rusage.ru_stime.tv_usec = system time used = 180000 us, rusage.ru_maxrss = maximum resident set size = 1674384 [sz], rusage.ru_ixrss = integral shared memory size = 34850 [sz tm], rusage.ru_idrss = integral unshared data size = 26793945 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 469667, rusage.ru_majflt = page faults = 65686, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable d4_23
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 25 s, rusage.ru_utime.tv_usec = user time used = 580000 us, rusage.ru_stime.tv_sec = system time used = 33 s, rusage.ru_stime.tv_usec = system time used = 930000 us, rusage.ru_maxrss = maximum resident set size = 1739920 [sz], rusage.ru_ixrss = integral shared memory size = 37120 [sz tm], rusage.ru_idrss = integral unshared data size = 29763122 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 492729, rusage.ru_majflt = page faults = 74581, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable d4_24
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 27 s, rusage.ru_utime.tv_usec = user time used = 260000 us, rusage.ru_stime.tv_sec = system time used = 36 s, rusage.ru_stime.tv_usec = system time used = 520000 us, rusage.ru_maxrss = maximum resident set size = 1805456 [sz], rusage.ru_ixrss = integral shared memory size = 39456 [sz tm], rusage.ru_idrss = integral unshared data size = 32935037 [sz], rusage.ru_isrss = integral unshared stack size = 0 [sz], rusage.ru_minflt = page reclaims = 515384, rusage.ru_majflt = page faults = 82745, rusage.ru_nswap = swaps = 0
        ncap: INFO ncap_var_write() writing variable d4_25
        ncap: INFO nco_mmr_rusage_prn() reports system type is AIX so rusage uses kilobytes [kB] for size and seconds [s] for time. Page size is 4096 B.
        ncap: INFO nco_mmr_rusage_prn() reports: rusage.ru_utime.tv_sec = user time used = 28 s, rusage.ru_utime.tv_usec = user time used = 830000 us, rusage.ru_stime.tv_sec = system time used = 39 s, rusage.ru_stime.tv_usec = system time used = 340000 us, rusage.ru_maxrss = maximum resident set size = 1870992 [sz], rusage.ru_ixrss = integral shared memory size = 41661 [sz tm], rusage.ru_idrs
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Charlie Zender - 2005-03-24
        
        Hi All,
        
        I think I plugged the memory leak in ncap_var_stretch().
        This leak was definitely exercised in LHS-casting.
        I re-ran the benchmark, results attached below.
        The maxrss size held steady at 232 MB.
        This is much less than the 2+ GB from before the patch.
        The benchmark may now run on laptops.
        There's no significant change in user or sys times,
        which is a bit disappointing.
        I'd hoped that old memory requirements were causing data
        to be swapped so often that reducing the requirements (by
        fixing the memory leak) would speed-up the benchmark, at
        least in wall-clock time. Alack and alas, 'tis not so.
        
        Each 4D variable in this benchmark is about 33 MB.
        The peak memory requirement (232 MB) seems large compared to 33 MB.
        We expect ncap to require at least 2*sizeof(var), yet ncap wants
        ~8*sizeof(var). Anyone have any ideas as to why?
        
        Thanks,
        Charlie
        
        Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
        
        real    0m6.33s
        user    0m1.59s
        sys    0m4.53s
        
        Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
        This takes ~ on a 2.8GHz Xeon - YMMV
        
        Starting to populate variables
        ncapping 8 scalar vars takes:
        
        real    0m41.68s
        user    0m2.41s
        sys    0m6.63s
        ncapping 8 1D vars takes
        
        real    0m27.12s
        user    0m2.52s
        sys    0m6.21s
        ncapping 16 2D vars takes:
        
        real    0m25.09s
        user    0m2.72s
        sys    0m6.35s
        ncapping 64 3D vars takes:
        
        real    1m26.77s
        user    0m5.99s
        sys    0m11.72s
        ncapping 32 4D vars takes:
        
        real    7m15.07s
        user    0m41.58s
        sys     0m53.46s
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2005-03-22
    
    Subject: Assessing and addressing ncap slowness, summary
    
    Hi All,
    
    Well, that's all the ncap patches that cried out for fixing.
    
    Here's my evaluation:
    1. The ncap script which creates the synthetic datasets for
       benchmarking stresses the system with disk activity.
    2. Further significant improvements may require ncap redesign.
       Entering and exiting metadata mode for a variable definition within
       each statement is, I think, what kills ncap on long scripts.
       A memory leak may also play a role, and is worth pursuing.
    3. Alternative methods of benchmark file creation may be much quicker,
       e.g., ftp'ing them over the network, but we learn more from trying
       to create these files synthetically!
    4. Threading ncap (it is currently not threaded) is unlikely to fix
       the problem since much of the slowness appears to be due to I/O
       contention which threading would worsen, not improve.
       Threading ncap is non-trivial anyway, because of potential
       dependencies between statements it is not simple to thread over
       the statement loop.
    
    Harry is evaluating the memory usage of the benchmark script to
    verify whether there are any leaks with long ncap scripts.
    A leak is the likeliest explanation, I think, why the benchmark data
    creation produces malloc() errors on my 1 GB laptop.
    BTW, developer's forum 20031121 has nice thread on ncap memory usage.
    If we're lucky we may get some memory profiling options built into the
    configure script as options.
    
    Finally, I've implemented one more enhancement suggested by Harry:
    
    6. Use memset(vp,0,sz) to zero float and integer arrays in nco_var_zero() and nco_zero_long()
    
    Benchmark after patching for #6 follows.
    No noticeable improvment in user or sys times.
    
    Thanks,
    Charlie
    
    **************************************************
    6. After using memset() in nco_var_zero() and nco_zero_long()
    **************************************************
    zender@esmf04m:~/nco/data$ time ./ipcc_dly_T85.create.sh
    Creating /ptmp/zender/nco/ipcc_dly_T85.nc file from template
    
    real    0m7.17s
    user    0m1.66s
    sys     0m5.49s
    
    Populating /ptmp/zender/nco/ipcc_dly_T85.nc file with floats
    This takes ~ on a 2.8GHz Xeon - YMMV
    
    Starting to populate variables
    ncapping 8 scalar vars takes:
    
    real    0m46.23s
    user    0m2.71s
    sys     0m6.03s
    ncapping 8 1D vars takes
    
    real    0m27.40s
    user    0m2.80s
    sys     0m5.89s
    ncapping 16 2D vars takes:
    
    real    0m10.46s
    user    0m2.74s
    sys     0m6.23s
    ncapping 64 3D vars takes:
    
    real    1m42.11s
    user    0m5.71s
    sys     0m12.26s
    ncapping 32 4D vars takes:
    
    real    8m47.94s
    user    0m40.49s
    sys     0m59.98s
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Harry Mangalam - 2005-04-13
  
  On Tuesday 12 April 2005 6:01 pm, Charlie Zender wrote:(quoting Martin Schultz)
  
  > if I am counting coreeclty, this will be item 120 on the wish list,
  > but perhaps there is a chance to move it up if other people also
  > think this would be useful:
  >
  > in analyzing model output, I often want to extract time series at
  > specific locations (stations) from the huge 3D fields. I have
  > therefore written a couple of scripts, which basically call ncrcat
  > as follows:
  >
  > ncrcat -h -d lon,13.42 -d lat,52.53 hist*.nc berlin.nc
  > ncrcat -h -d lon,10.97 -d lat,47.4 hist*.nc zugspitze.nc
  >
  > etc.
  >
  > As there are about 50-60 stations which I am interested in, this
  > mutates into a hard disk test program (in fact, we have had to
  > replace about 4 hard drives after I had run this script for a while
  > ;-)
  
  So if I'm reading this right, you want the benchmarks to reflect this
  disk thrashing test, so that further improvements in the code will
  decrease this effect. However, until we're actually addressing this
  problem, do you want to thrash disks at every iteration of the
  benchmark test? He was probably kidding about the killed disks, but
  it's a consideration with large, poorly tuned ops.
  
  > Of course, this kind of extraction could be done much more
  > efficiently if all stations would be extracted while the file is
  > opened only once. What would be needed is some sort of "list input"
  > option, specifying all the slabs to be processed and the respective
  > output names. To take things one step further, this could even
  > evolve into a mini "scripting" language, so that e.g. ncra could
  > compute averages and standard deviations in one pass through the
  > data.
  
  the above sounds like what you've implemented in ncap - is the
  scripting stuff modular enough that it can be called in all of the
  operators or would it be easier to have this functionality
  implemented in ncap?
  
  > The calling format could be something like
  > nccommand -list_input listfile in.nc
  > and the listfile would list the other options and the output files:
  > -d lon,13.42 -d lat,52.53 berlin.nc
  > -d lon,10.97 -d lat,47.4 zugspitze.nc
  >
  > Implicitly, this would probably require -A to be set automatically,
  > so that the output files are automatically continued (at least for
  > ncrcat; ncks might want to ask the user instead).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2005-04-13
    
    Hi Harry,
    
    >>in analyzing model output, I often want to extract time series at
    >>specific locations (stations) from the huge 3D fields. I have
    >>therefore written a couple of scripts, which basically call ncrcat
    >>as follows:
    >>
    >>ncrcat -h -d lon,13.42 -d lat,52.53 hist*.nc berlin.nc
    >>ncrcat -h -d lon,10.97 -d lat,47.4 hist*.nc zugspitze.nc
    
    FYI, there is a large and growing community of people who compare
    global models (or measurements) to station data, so this is a common
    situation. I have similar scripts that extract data from ~200 AERONET
    stations to compare models against measurements...
    
    >>As there are about 50-60 stations which I am interested in, this
    >>mutates into a hard disk test program (in fact, we have had to
    >>replace about 4 hard drives after I had run this script for a while
    >>;-)
    
    > So if I'm reading this right, you want the benchmarks to reflect this
    > disk thrashing test, so that further improvements in the code will
    > decrease this effect. However, until we're actually addressing this
    > problem, do you want to thrash disks at every iteration of the
    > benchmark test? He was probably kidding about the killed disks, but
    > it's a consideration with large, poorly tuned ops.
    
    Yes, that's a good point. Let's keep this out of the benchmarks for now.
    Let's remember that this is something NCO is bad at, and thus an
    opportunity for future improvements.
    
    >>Of course, this kind of extraction could be done much more
    >>efficiently if all stations would be extracted while the file is
    >>opened only once. What would be needed is some sort of "list input"
    >>option, specifying all the slabs to be processed and the respective
    >>output names. To take things one step further, this could even
    >>evolve into a mini "scripting" language, so that e.g. ncra could
    >>compute averages and standard deviations in one pass through the
    >>data.
    
    > the above sounds like what you've implemented in ncap - is the
    > scripting stuff modular enough that it can be called in all of the
    > operators or would it be easier to have this functionality
    > implemented in ncap?
    
    The needs of the Original Poster (Martin Schultz) cannot be met by
    ncap in anything like its present form because ncap lacks two crucial
    features: ability to handle multiple-input files (it is not an MFO =
    multi-file operator), nor does it allow hyperslabbing.
    After further consideration, the OP's suggestion of a new, "list
    input" for hyperslabbing syntax makes good sense as an addition,
    not a replacement, to the current multi-slabbing algorithm (MSA).
    Such syntax would be very useful in the MFOs ncrcat and ncecat.
    
    It is debatable whether creating multiple output files in one
    invocation would be preferable to storing all the specified hyperslabs
    in one output file with a new dimension to glue together hyperslabs.
    ncecat creates a new dimension to glue together its input files, so
    the precedent is clear. The user could then hyperslab that single
    output file along the region dimension to create the final set of
    files.
    
    People should read my response to the OP on the Open Discussion forum
    on 2003-05-06. It's important to understand that the current MSA does
    allow multiple hyperslabs, but it has no way of linking a specific
    hyperslab in one dimension, e.g., lat, to a specific hyperslab along
    another dimension, e.g., lon, so it collects hyperslabs for each
    lat,lon combination instead.
    
    Charlie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2005-04-14
  
  Hi Charlie & harry,
  Have coded some stuff that optimizes ncap -- Its a bit rough at the moment. Would it be possible to put it into a separate CVS branch untill its fully tested -- if so how ?
  Files changed
  ncap.h
  ncap.c
  ncap_utl.c
  ncap_yacc.y
  ncap_lex.l
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Harry Mangalam - 2005-04-14
    
    Hi Henry,
    That'sgreat! I can create a branch for you or you can - the process is pretty simple:
    - check out the base current code (should give it a different name tho - maybe nco_ncap-X with the -d opt
    
    - cd into the newly checked out tree and tag it with a new name - maybe '2005_04_14_ncap-X_branch'
    
    The free RedBean CVS book (in pdf) describes this in detail, starting at p73.
    http://cvsbook.red-bean.com/
    
    You can do it or I can do it and then send you the checkout info. Lemme know.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2005-04-15
  
  Hi Harry,
  My optimized code has a bug - -will try to fix& release this weekend
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2005-04-18
  
  Hi Harry & Charlie,
  What I've done to optimize ncap is to parse the script twice.
  In the first parse variables are "wrtten to memory" rather than to disk. After this the list of variables are defined in the output.
  In the second parse these variables are populated with real variables. To realize this - means in practice wrting code around ncap_var_init() and ncap_var_write(). Im so close to completion it hurts -- I just need to solve some memory problems -- ncap.in almost runs
  
  The advantages of my approach is that if/when we add dimension altering functions the code will still be able to calcuate of the dims/size of the final vars. One problem is
  double the memory leakage
  
  ( this is different from the old code where the script was scanned once then parsed once)
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-04-18
  
  Hi Henry,
  
  The algorithm you describe sounds distinctly different than the one
  I proposed in the "ncap top level loop re-design" thread
  http://sourceforge.net/forum/message.php?msg_id=3073430
  which is where I wil follow up with some questions....
  
  Charlie
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Harry Mangalam - 2005-04-19
  
  Hi All,
  
  After much more time than I would have thunk, I've committed an updated script (<nco_root>/bm/nco_bm.pl) that combines the regression testing, file creationism & ncapping, and benchmarking. It's not yet integrated into the Makefiles yet, but it will be very shortly (really). Until then it can be used to manually run and generate regression results and benchmarks. Call it by itself to get a usage description.
  NB: as of this time, on LInux x86, the regression testing passes everything but ncap, and completes the benchmark. On Linux AMD64, it fails another regression test (which it should) that mysteriously passes on Linux x86 and also completes the benchmark.
  On AIX, it fails just about everything miserably, so there's some code examination that needs to be done.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Harry Mangalam - 2005-04-22
  
  This may be known well by everyone else, but I just found it.
  Valgrind is a very useful tool for memory checking, but the underlying structure (a synthetic x86 CPU) allows it to also act as a profiler. It comes with 'callgrind', a calltree/profiler that can extract a surprising amount of info that you can then visualize with the kcachegrind app. Using this approach to an app that I (thought I) knew well. It was quite revealing to see where the real time usage was going on. One thing that I had spent a huge amount of time optimizing because I thought it would be the core CPU sink used less than 6% of the time (tho maybe because I HAD spent so much time optimizing it). Calls to Memory allocation were much more expensive.
  
  Additionally, using the callgrind approach is very nice becasue you don't have to instrument the code in any way, and it runs fairly quickly compared to using the -pg flag in the compiler (and doesn't have to be re-compiled to UNinstrument it).
  
  Granted, this only describes what happens in an x86 but I'd imagine that the overview is probably quite useful in general. Note that valgrind itself can run on any architecture, but it emulates an x86.
  
  harry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2005-04-23
    
    I am very curious what any profiler says about the benchmarks.
    Is valgrind fast-enough to permit profiling the benchmark script?
    Where does ncra spend its time inside the benchmark script?
    
    Thanks,
    Charlie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Harry Mangalam - 2005-04-23
      
      I think so - I did the set-up and testing work on my laptop - I'll run the ncra bench tomorrow on soot or sand (when it'll be raining...). I thought I saw that it exists for amd64
      
      It looks like it takes about 5 times longer for the callgrind runs so that;s easily do-able.
      
      hjm
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2005-05-04
  
  Hi Charlie & Harry,
  Have tried to optimize LHS casting
  . It work Ok with a scalar on the RHS but is tempramental with vars on the RHS especially zero dimensional vars.
  will keep working away
  
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Charlie Zender - 2005-05-04
    
    Hi All,
    
    FYI, I changed the prototype of ncap_var_init() to
    
    var_sct *ncap_var_init(const char * const,prs_sct *);
    
    to avoid breakage on Henry's most recent ncap commit.
    
    Charlie
    
    gcc-3.4 -std=c99 -pedantic -D_BSD_SOURCE -DLINUX -DNO_NETCDF_2 -DVERSION='20050504' -DHOSTNAME='ashes' -DUSER='zender' -DNCO_ABORT_ON_ERROR -DHAVE_REGEX_H -DNCO_HAVE_REGEX_FUNCTIONALITY -DHAVE_MKSTEMP -DHAVE_GETOPT_H -DHAVE_GETOPT_LONG -DENABLE_UDUNITS -DHAVE_UDUNITS_H -Di386 -I./ -I../src/nco -I/usr/local/include -I/usr/local/include -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wall -O -g -march=pentium4 -c ../src/nco/ncap_utl.c -o /home/zender/obj/LINUX/ncap_utl.o
    ../src/nco/ncap_utl.c:15: error: conflicting types for 'ncap_var_init'
    ../src/nco/ncap.h:208: error: previous declaration of 'ncap_var_init' was here
    ../src/nco/ncap_utl.c:15: error: conflicting types for 'ncap_var_init'
    ../src/nco/ncap.h:208: error: previous declaration of 'ncap_var_init' was here
    ../src/nco/ncap_utl.c: In function `ncap_var_write':
    ../src/nco/ncap_utl.c:162: warning: unused variable `fll_md_old'
    m
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2005-05-06
  
  Hi All,
  Have commited corrections to ncap_*.[ch] & nco_cnf_typ.c , so double parsing now works.
  LHS casting should be even faster now (removed initialization of var data in var_LHS on the first pass)
  Please benchmark away.
  N.B this is not the finished article
  Regards Henry
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Charlie Zender - 2005-05-07
  
  Hi Henry,
  
  FYI your modifications to ncap sometime in the last two weeks appears to have caused a regression with LHS casting. You may
  already know this, but in case you don't, the regression actually
  shows up in one of the ncks tests:
  
  /home/zender/bin/LINUX/ncks -O -v lat_T42,lon_T42,gw_T42 in.nc foo_T42.nc
  /home/zender/bin/LINUX/ncrename -O -d lat_T42,lat -d lon_T42,lon -v lat_T42,lat -v gw_T42,gw -v lon_T42,lon foo_T42.nc
  /home/zender/bin/LINUX/ncap -O -s "one[lat,lon]=lat*lon*0.0+1.0" -s "zero[lat,lon]=lat*lon*0.0" foo_T42.nc foo_T42.nc
  /home/zender/bin/LINUX/ncks -C -H -s "%g" -v one -F -d lon,128 -d lat,64 foo_T42.nc
  
  The LHS casting of lat*lon fails where it used to succeed.
  
  Thanks,
  Charlie
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benchmarking thread

Command-line operators for netCDF and HDF files

Forums

Help

Benchmarking thread

Benchmarking thread

Command-line operators for netCDF and HDF files

Forums

Help

Benchmarking thread document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Benchmarking thread