Menu

Very slow I/O on large blocksize filesystems

2011-12-22
2013-10-17
1 2 > >> (Page 1 of 2)
  • Gary Strand

    Gary Strand - 2011-12-22

    (Preface: I've been a very happy user of nco for many years)

    Technical details:
    NCO netCDF Operators version "4.0.8" last modified 2011/04/26 built Oct 18 2011 on mirage4 by jam
    ncks version 4.0.8
    Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
    Copyright (C) 1995-2011 Charlie Zender

    Problem: This issue may be related to the NOFILL issue with netCDF 4.1.2; in any case, on filesystems with large blocksizes (2M, for example, 'lustre' and NCAR's GLADE system) the I/O performance of even simple 'ncks' operations is horrible - time-to-completion ratios (compared to smaller blocksize filesystems) of 300:1 or even 1500:1 are not uncommon.

    Investigation with NCAR CISL staff showed that a simple variable extraction that takes about 20 seconds on a small blocksize filesystem takes about 40 minutes on the GLADE filesystem (120:1 ratio) and that the following was found:
                                                     
    12/20/11 3:57 PM JAM
    I should add that the actual performance for the first 39 minute test was around 30MB/sec for reads and 12MB/sec for writes.  So nco may be doing something else inefficiently in addition to reading/writing extra data.
                                                     
    12/20/11 3:52 PM JAM
    Hi, we've done some testing since first getting this ticket and have found that the performance of ncks on filesystems with large block sizes (most of Glade is at a 2MB block size) is VERY bad and it seems to be reading/writing much more data than necessary.

    The test we used was: "ncks -x -v TH b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc"

    - The input file is 3.3GB and the output file is 1.1GB.  On an idle system (storm4) this command took around 39 minutes to complete when either input or output file is on a Glade filesystem.  During this time 60GB was read from glade and 26GB was written.

    - Adding the -4 option to ncks resulted in the test taking 10 minutes to complete with 30GB read and 1.2GB written. 

    - We also ran the same tests on a Lustre filesystem with a 1MB block size and saw similar bad performance. 

    - Finally, running the same test with both input/output files in /tmp (local drive, 4k block size) finished in 17 seconds.

    I don't know exactly what ncks does or how it does it, but there seems to be an issue with large-block filesystems possibly causing it to read and write overlapping blocks of data, resulting in the very large numbers of extra bytes read/written listed above.  Large-block filesystems also caused the silent data corruption issue with nco a few months back, which could be related.

    With this information and your plots, the effect of system load and number of users is not as significant as we originally thought, and the bad performance on glade is most likely related to the actual amount of data being transferred (86GB in the worst case).

    Is this an NCO problem or a netCDF-4 problem?
                                              

     
  • Charlie Zender

    Charlie Zender - 2011-12-22

    Hey Gary,

    That's crazy slow.
    Thanks for reporting this. Have not test NCO w/ netCDF 4.1.3 on GLADE myself.
    Will need to do so before having any substantive comments.
    But, am leaving tomorrow on internet-less trip until 1/1.
    So, two suggestions that may tide you over until 2012:
    Try ~zender/bin/*/ncks, see if that works better.
    Try older versions of /usr/local/bin/ncks, that CISL usually puts in other directories.
    These could be affected by NOFILL bug, though, so best to try -4 on output files.
    c

     
  • Gary Strand

    Gary Strand - 2012-01-03

    Thanks, Charlie.

    I've played a bit with your suggestions, and haven't seen any improvement. I'd really rather prefer keeping the files I'm creating as netCDF-3, so the '-4' option isn't viable.

    Thanks for looking into this.

     
  • Charlie Zender

    Charlie Zender - 2012-01-03

    I am back in the offfice after two weeks. I found my cryptocard. If my  glade login still works then I will look at this today. Please post the location of your test case file so I can duplicate it exactly, i.e., what is the path to

    b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc

     
  • Gary Strand

    Gary Strand - 2012-01-03

    See

    /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc

    Thanks!

     
  • Charlie Zender

    Charlie Zender - 2012-01-03

    Hi Gary,

    I have reproduced the problem you are experiencing using NCO on the
    large block filesystem (LBF) named GLADE. The binaries in
    ~zender/bin/ improve the performance by about a factor
    of two relative to NCO 4.0.8, but something lower in the software
    stack than NCO, i.e.g, the netCDF library or the filesystem itself
    seems to cause the gross degradation in performance relative to
    NCO on smaller block filesystems.

    Without going into too much detail, and for the benefit and comment of
    others following this issue, my conclusions about the slow performance
    of NCO on LBFs (i.e., GLADE) on both AIX and Linux are:

    0. NCO (and ncks in particular) doesn't use any fancy algorithms.
    NCO uses only offical, documented netCDF API calls to do its work.
    NCO does not pay attention to block-sizes. Unless hyperslabbing is
    requested, NCO transfers entire variables with _one call_ (rather than
    with continuous/consecutive calls) to nc_var_.
    1. Slow performance on LBFs is experienced when any version of NCO is
    linked to any netCDF version including 4.1.3. I tested this with NCO
    3.9.6 on AIX (using the bluefire default, i.e., /usr/local/bin/ncks
    which is linked to netCDF 3.6.2), and Gary or CISL tested this with
    NCO 4.0.8 on Linux (unsure what library they used).
    2. NCO version 4.0.8 worsens the performance relative to other
    versions of NCO, but changes in 4.0.8 do not cause the underlying
    problem. 4.0.8 uses netCDF fill-mode to workaround the netCDF 4.1.2
    (and all preceding versions) "NOFILL" bug. This causes 4.0.8 to write
    (at least) twice as much data as other versions of NCO.
    3. NCO version 4.0.9, which is in beta and not yet released, improves
    the performance by about a factor of two relative to 4.0.8. This is
    consistent with the reversion of 4.0.9 to previous NCO behavior which
    utilizes the netCDF NOFILL feature to reduce writes by (at least) a
    factor of two. It is only safe to use NCO 4.0.9+ with netCDF
    4.1.3+. Otherwise the netCDF NOFILL bug may be triggered.
    4. NCO operations on LBFs are twice as fast on Linux as on AIX.
    Extracting large datasets to netCDF3 files rather than netCDF4 files
    takes ~2.5 times as long. These factors are independent, so the best
    performance on large block filesystems is obtained with NCO 4.0.9 (or
    any NCO except 4.0.8) under Linux writing netCDF4 files. The worst
    performance will be with NCO 4.0.8 under AIX writing netCDF3 files.
    5. Improving NCO performance on LBFs may require more detailed
    performance analysis and algorithms for sub-setting. An obvious place
    to start is to use a blocksize-sensitive copy size. Recent versions of
    nccopy use such an algorithm, I believe. However, this would require a
    significant code refactoring for NCO, which is not currently funded.
    However, NASA may fund implementation of groups in NCO. More on that
    in coming weeks. Maybe those funds can leverage some of this work.
    6. Having written this much I'd like to hear from others before
    blabbing-on. I wasn't aware there was any penalty for LBFs, so credit
    goes to Gary for reporting the dramatic slow-downs on GLADE.
    Any good ideas for methods to speed up netCDF3 writes on LBFs?
    Are these performance penalties for LBFs better understood by others?

    Charlie

    Output of selected commands (extraneous stuff deleted):

    # Copying 3 GB takes ~1 minute with AIX on GLADE
    zender@be1005en:~$ time /bin/cp /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc ~/gary.nc
    real    1m3.219s

    # Copying 3 GB takes ~30 seconds with Linux on GLADE, twice as fast as AIX
    zender@mirage0:~$ time /bin/cp /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc ~/gary.nc
    real    0m30.812s

    # Test case takes ~8 minutes with ncks 3.9.6 on AIX
    zender@be1005en:~$ /usr/local/bin/ncks -lbr
    Linked to netCDF library version "3.6.2", compiled Apr  3 2007 14:19:36
    zender@be1005en:~$ /usr/local/bin/ncks -vrs
    NCO netCDF Operators version "3.9.6" last modified 2009/01/21 built Jan 28 2009 on be1105en by ddvento
    ncks version 3.9.6
    zender@be1005en:~$ time /usr/local/bin/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_3.9.6.nc
    real    8m9.658s

    # Test case takes ~8 minutes with ncks 4.0.9 on AIX
    zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks -lbr
    Linked to netCDF library version 4.1.3, compiled Aug 25 2011 08:32:40
    zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks -vrs
    NCO netCDF Operators version 20120103 built Jan  3 2012 on be1005en.ucar.edu by zender
    zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_4.0.9.nc
    real    7m48.197s

    # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on AIX
    zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -4 -D 3 -x -v TH ~/gary.nc ~/out4_blf_4.0.9.nc
    real    2m42.123s

    # Test case takes ~4 minutes with ncks 4.0.9 on Linux
    zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks -lbr
    Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
    zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks -vrs
    NCO netCDF Operators version 20120103 built Jan  3 2012 on mirage0 by zender
    zender@mirage0:~$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_mrg_4.0.9.nc
    real    4m15.493s

    # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on Linux
    zender@mirage0:~/nco$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -4 -D 3 -x -v TH ~/gary.nc ~/out4_mrg_4.0.9.nc
    real    1m44.345s

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-06

    It's taken a while, but I think I have a fairly complete explanation
    for this slowdown, when it occurs, and how to deal with it.

    The slowdown occurs when all these conditions are met:

      1.  You're dealing with netCDF classic or 64-bit offset format
          files, not netCDF-4 or netCDF-4 classic model files.

      2.  You have an unlimited dimension and many record variables that
          use it.

      3.  The file system has a large block size, the atomic size for
          disk access.

    In this case, doing things a variable at a time instead of a record at
    a time can be very slow, because accessing all the data in a variable
    (or some part of each record for a variable) typically reads each
    record multiple times, once for each record variable you're dealing
    with.  That's because the block size is larger than it needs to be to
    hold a record's worth of data for each variable, so accessing the nth
    record's data for a variable typically reads in more data than is
    needed.

    Consider a case that's not too atypical: a block size of 2 MiBytes,
    365 records, and 100 float record variables, each dimensioned
    (time=365, lat=73, lon=145) where time is the record dimension.  A
    record's worth of data for each variable is only 73*145*4 = 42340
    bytes, and each variable has 365 records.  So reading one variable of
    size 365*73*145*4 of about 15.5 Mbytes actually reads 365 disk blocks,
    which is 365*2097152 bytes or 765 Mbytes.  That's about 50 times more
    bytes read than needed.  If you operate on every variable in the file,
    one at a time, the result is 50 times more I/O than necessary, which
    explains why it might be 50 times slower than it would be if you used
    fixed size variables, stored contiguously, rather than record
    variables, stored in pieces scattered throughout the records of the
    file.

    How can you deal with this to get efficient processing of such files?
    Here are some workarounds and solutions:

    1.  Don't use the unlimited dimension if you don't really need it.

    2.  Make sure the record size of each variable is at least as big as
        the disk block size.

    3.  Convert your record-oriented file to a file with only fixed size
        dimensions before using it in processing.  There's an nco operator
        for this, or you can use "nccopy -u infile outfile" to make the
        unlimited dimension a fixed size.

    4.  Change the processing algorithms to read input a record at a time
        instead of a variable at a time, processing all the record
        variables after each record variable has been read.

    5.  Use netCDF-4 classic model files or regular netCDF-4 files.  With
        the netCDF-4/HDF5 format, data is accessed in disk blocks, if
        stored contiguously, or by chunks for chunked data.  A chunk only
        contains data from a single variable.  Making chunks larger than
        disk blocks insures that I/O will be efficient.  If data is
        compressed, each chunk is compressed separately, so if compressed
        chunks are much smaller than the disk block size, inefficiencies
        may still occur.

    I will be using approach number 4 to detect and deal with this
    situation on systems with large block size.  If anyone has useful
    heuristics for this, I'd like to hear about them.

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-06

    A couple of typos/brainos slipped through that may have made that last post unclear:

    > 4.  Change the processing algorithms to read input a record at a time
    >     instead of a variable at a time, processing all the record
    >     variables after each record variable has been read.

    should have been

      4.  Change the processing algorithms to read input a record at a time
          instead of a variable at a time, processing all the record
          variables after each record has been read.

    and

    > I will be using approach number 4 to detect and deal with this
    > situation on systems with large block size.

    should have been

      I will be using approach number 4 in nccopy to detect and deal with this
      situation on systems with large block size.

    -Russ

     
  • Charlie Zender

    Charlie Zender - 2012-03-06

    Thanks for figuring this out, Russ.
    It makes sense and I learned more about what blocksize really means.

    The relevant NCO command is

    ncks -fix_rec_dmn in.nc out.nc

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-07

    I've added a couple of hundred lines of C code to nccopy to implement this new
    rule of thumb:

      When accessing data from netCDF classic or 64-bit offset format files
      that have multiple record variables and a lot of records on a file
      system with large disk block size relative to a record's worth of data
      for one or more record variables, access the data a record at a time
      instead of a variable at a time.

    The new version of nccopy will be in the upcoming 4.2 release.

    The improvement this makes in a typical case of using nccopy to convert or copy
    a 1.13 GB sample file that has lots of record variables, on a system with 512KB
    disk block size is significant: a factor of 24 in elapsed time.  I just tried it on the
    same sample file and the /glade/scratch file system on one of NCAR CISL's platforms,
    and there the result was a factor of 60 speedup.

    It's unfortunate that this is not a problem that can be fixed in the library, it's a matter of
    using the above rule of thumb in applications that make use of the library, which requires
    a special case.  However, it turns out that I didn't have to bother with testing the disk
    block size at all in the new nccopy code.  The special case code works fine and is no
    slower even when the disk block size is small, like 4096 bytes.  It just speeds things up
    more for a larger disk block size.

    Actually we may eventually be able to speed things up in the netCDF library as well for
    classic-format files by replacing all the (over-)optimized code that currently uses read(2)
    and write(2) system calls with ordinary stdio fread(3) and fwrite(3) calls, but that may have
    to wait for version 4.3 …

    -Russ

     
  • Charlie Zender

    Charlie Zender - 2012-03-08

    Interesting. It should be possible to implement a similar patch to ncks et al. before the next release. This is now TODO nco1036.
    c

     
  • Si Liu

    Si Liu - 2012-03-08

    Hi Russ and Charlie,

    Thank you for all your kind help. I have re-compiled Russ's modified version of netcdf-4.2 on our machine (NCAR mirage/storm)
    and the huge performance raise of nccopy on netcdf3 files has been noticed.

    We found that nccopy accesses the data a variable at a time instead of a record at a time.
    Besides nccopy, are there any other (netcdf) functions that access the data a variable at a time and need to be modified?

    I once assumed that NCO functions, e.g. ncks, call nccopy and other netcdf functions directly. So fixing nccopy and other functions would fix NCO problems without touching NCO source code. It looks like I am wrong on this. NCO functions also need to read the data and decide if they read a variable or a record by themselves. Could you confirm this?
    If that is the case, how many NCO functions need to be modified? Is that a lot of extra work?

    For Gary's ncks problem, I have a temporary but ugly solution. I can create a wrapper to:
    1) convert netcdf3 format to netcdf4 format using new nccopy
    2) ncks implementation on netcdf4 format
    3) convert netcdf4 format back to netcdf3 format if necessary
    All steps will be very efficient on large-block file systems now using the new nccopy command.

    Best wishes,
    Sincerely Si Liu
    NCAR CISL Consulting

     
  • Charlie Zender

    Charlie Zender - 2012-03-08

    Hi Si,

    Your understanding is correct.

    The most important place NCO needs to be patched is nco_cpy_var_val() in file nco_var_utl.c.
    That would solve Gary's problem and is a relatively straightforward change to make.
    There is one other routine used for copying when limits are specified, nco_cpy_var_val_mlt_lmt() in file nco_msa.c, but that is lower priority because it is much more complex and it is invoked less often.

    Although it is high on our TODO list, we will gladly accept patches!
    cz

     
  • Charlie Zender

    Charlie Zender - 2012-03-08

    Hi Russ,

    I can't find your nccopy patch in the daily snapshot of netCDF 4.2.
    Is it in nccopy.c? If so, which lines approximately contain the core of the patch? If not, where is it?

    Thanks,
    c

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-09

    Hi Charlie,

    It's apparently not in the daily snapshot yet, just the svn trunk for 4.2.  You can access that from http://svn.unidata.ucar.edu/repos/netcdf/trunk/.  The new code is mainly in ncdump/nccopy.c, with a function prototype declaration and code moved from dumplib.h/dumplib.c to utils.h/utils.c in the same directory.  The code is isolated in the functions called from the copy() function in nccopy.c, in this block:

        /* For performance, special case netCDF-3 input file with record
         * variables, to copy a record-at-a-time instead of a
         * variable-at-a-time. We should also eventually do something
         * similar for netCDF-3 output file, but converting netCDF-4 files
         * to netCDF-3 files is less common … */
        if(nc3_special_case(igrp, inkind)) {
    size_t nfixed_vars, nrec_vars;
    int *fixed_varids;
    int *rec_varids;
    NC_CHECK(classify_vars(igrp, &nfixed_vars, &fixed_varids, &nrec_vars, &rec_varids));
    NC_CHECK(copy_fixed_size_data(igrp, ogrp, nfixed_vars, fixed_varids));
    NC_CHECK(copy_record_data(igrp, ogrp, nrec_vars, rec_varids));
        } else {    
    NC_CHECK(copy_data(igrp, ogrp)); /* recursive, to handle nested groups */
        }

    I'm open to suggestions for making the code clearer or any other improvements.  I'm also happy to answer questions about it ..

    -Russ

     
  • Charlie Zender

    Charlie Zender - 2012-03-09

    Thanks. OK. This makes the desired algorithm clear. I see you do not actually check the blocksize. This is because there is
    no penalty for adopting this method on regular/small blocksize filesystems? Also, the algorithm copies data for the same record for all variables, then moves to the next record until there are no more records. Would it also work (i.e., be fast) to copy all of one variable, one record at a time, and then move to the next variable? Or have you ruled that out as being slow?
    Thanks,
    c

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-09

    In short, yes, yes, no, and no.

    The old nccopy (before implementing the special case for classic format files with multiple record variables) copied all of each variable whether a record variable or a fixed-size variable, and was too slow when there were lots of record variables.

    I've generated a perverse netCDF example file that demonstrates this performance problem even on file systems with disk block sizes of only 4096 bytes.  The file, which I've christened perverse.nc, has 2000 record variables, 4000 records, and is only 16 MB as a classic model file.  If you want to use it for testing, it's here:

      http://www.unidata.ucar.edu/staff/russ/public/perverse.nc

    Using the old nccopy on this relatively small netCDF file, using variable-at-a-time processing, takes about 30 seconds on my desktop Linux platform.  With the improved nccopy that implements the special case code, it's nearly an order of magnitude faster:

    work/ncdump$ clear_cache && /usr/bin/time nccopy perverse.nc tmp.nc
    + sync
    + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
    2.97user 24.74system 0:28.79elapsed 96%CPU (0avgtext+0avgdata 7740maxresident)k
    37744inputs+31448outputs (25major+1974minor)pagefaults 0swaps
    work/ncdump$ clear_cache && /usr/bin/time ./nccopy perverse.nc tmp.nc
    + sync
    + sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
    1.90user 0.07system 0:03.14elapsed 62%CPU (0avgtext+0avgdata 3632maxresident)k
    44072inputs+31448outputs (42major+4481minor)pagefaults 0swaps

    To make measurements like this vallid, it's necessary to clear the I/O buffer cache to make sure reads are really coming from disk rathe than from memory, and for that I use this little shell script I inherited from Ed Hartnett, named "clear_cache.sh", that works on Linux systems:

    #!/bin/bash -x
    # Clear the disk caches.
    sync
    sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

    -Russ

     
     

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-09

    Oops, correct that first line:
    > In short, yes, yes, no, and no.
       In short, yes, yes, no, and yes.

    That is, yes, I've ruled out variable-at-a-time processing as being too slow for record variables when there are lots of them and lots of records.

    -Russ

     
  • Charlie Zender

    Charlie Zender - 2012-03-13

    Gary + Russ,

    I just committed ncks code which implements Russ' algorithm to hasten
    copying of record variables under circumstances that could cause
    slowness on Large Blocksize Filesystems or perverse files.
    Then I re-compiled and re-ran the commands shown in message #7 here

    https://sourceforge.net/projects/nco/forums/forum/9829/topic/4898620

    This tests Gary's test file.

    With unpatched ncks code this took 4m15s on Linux, 7m48s on AIX.
    The new ncks results are           0m29s on Linux, 1m04s on AIX.
    The tests in message #7 show that the new times are very close to the
    time for a raw file copy using /bin/cp.
    So the factor of eight speed-up in each case is as good as can be.

    Below are the commands used to generate the tests.
    The code is now on the NCO trunk, tagged as nco-4_1_0.
    I plan to release 4.1.0 soon after Unidata releases netCDF 4.2.0.
    I don't want to scoop Russ, since it's his algorithm I implemented.
    So right now what I'm calling NCO 4.1.0 is really beta code.
    NCO users can always install the latest code or use the executables
    in ~zender/bin/AIX if they wish. These are also stored here:

    http://nco.sf.net/src/nco-4.1.0.aix53.tar.gz

    time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_4.1.0.nc

    time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_mrg_4.1.0.nc

    Feedback welcome, hope this helps,
    Charlie

     
  • Si Liu

    Si Liu - 2012-03-13

    Thank you, Charlie.
    I will recompile it and double check the results on glade file system soon.

    Si

     
  • Charlie Zender

    Charlie Zender - 2012-03-15

    Hi Russ,

    I'm tidying up that patch in NCO, and have some further questions.
    In order to prioritize patching more of NCO, I want to know how the
    slowdown reading compares to the slowdown writing.
    To simplify my questions, let's use the abbreviations MM3 and
    MM4 for netCDF3 and netCDF4 Multi-record Multi-variable files,
    respectively. My understanding is that MM3s are susceptible to the 
    slowdown, while MM4s are not, and that writing MM3s without the
    patch incurs incurs more of a penalty than reading MM3s.
    So this is how I prioritize implementing the MM3 patch:

    1. When copying MM3 to MM3. Done in ncks, TBD in others.
    2. When copying MM4 to MM3. Done in ncks, TBD in others.
    3. When copying MM3 to MM4. Not done anywhere.
    4. When reading MM3 and not writing anything. Not done anywhere.

    Currently ncks always uses the algorithm for cases 1 and 2 (i.e.,
    whenever writing to an MM3), but not for cases 3 and 4.

    The rest of NCO does not yet use the MM3 algorithm, yet there are many
    places where it would potentially benefit. I've heard through the
    years that sometimes ncecat slows to a crawl. Perhaps the MM3 slowdown
    is responsible. On the bright side, ncra and ncrcat are immune from
    the slowdown because they already read/write all record variables
    record-by-record.

    Does the prioritization above make sense? If so I will next patch
    the rest of NCO to do cases 1 and 2, before patching anything to do
    cases 3 and 4. 

    c

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-15

    Hi Charlie,

    That sounds right to me, because when you're just reading a small
    portion of a disk block, you only incur the extra time for reading
    data you won't use., but when you're writing, you have to read it all
    in and rewrite the part you're changing as well as the data you didn't
    change.  So writing with large disk blocks would seem to require twice
    the I/O of reading.

    In nccopy, I just implemented the algorithm in cases 1 and 3; case 4
    doesn't occur.  I had thought case 2 currently wasn't very common, so
    it could wait, but your question has led me to rethink this.  A fairly
    common use of case 2 is converting a compressed netCDF-4 classic model
    file to an uncompressed classic file, for use with applications that
    haven't been linked to a netCDF-4 library or in archives that will
    continue to use classic format.

    I've been trying to figure out whether implementing case 2 for
    compressed input could require significantly more chunk cache than not
    using the MM3 algorithm, if you want to avoid uncompressing the same
    data over and over again.  But I think just having enough chunk cache
    to hold the largest compressed chunk for any record variable would be
    sufficient, so I've tentatively concluded that it's not an issue.
    (Where things get complicated is copying MM4 to MM4 while rechunking,
    to improve access times for read access patterns that don't match the
    way the data was written.)

    Thanks for presenting your prioritization.  It looks like I've got
    some more work to do, implementing case 2 in nccopy.

    -Russ

     
  • Russell K. Rew

    Russell K. Rew - 2012-03-15

    Actually the fix to add case 2 to nccopy was only about 15 minutes of work, so it will be in release 4.2 …

    -Russ

     
  • Si Liu

    Si Liu - 2012-03-26

    Hi Charlie,

    It took me a while to do more research on our WRF and CESM model and talk to some of our users.
    I believe your priority order is great. Actually, most of our big model runs only focus on #1.
    That is our main concern at this time.

    Thanks you for all your help.
    Sincerely Si

    1. When copying MM3 to MM3. Done in ncks, TBD in others.
    2. When copying MM4 to MM3. Done in ncks, TBD in others.
    3. When copying MM3 to MM4. Not done anywhere.
    4. When reading MM3 and not writing anything. Not done anywhere.

     
  • Si Liu

    Si Liu - 2012-04-02

    Hi Charlie and Russ,

    One of our NCO users, Andy Mai, recently mentioned the nc__open solution to me for our slow performance problem.
    His email is attached below. From my understanding, that solution will not solve all our problems. Please confirm that.

    Charlie, what is the current status of your NCO fix? Will that take a long time to fix the rest of them?
    If it still needs  a long time to finish, can I have your current/latest version of NCO?
    I can make it an in-official version on our machine and our users can use it instead of the old slow version.
    We believe a partly fixed version is also going to help many of our NCO users.

    Thanks a lot.
    Sincerely Si

    Email from Andy:
    I went to Argonne for the semi-annual ParVis meeting. I talked to Rob Latham, a high-performance I/O expert. He told me to use nc__open rather than nc_open in the source code for NCO as a solution to our problem of needing 32 minutes to read a 3.2GB NetCDF file. He is certain that this will fix our problem. See documentation here:
    http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-c/nc_005f_005fopen.html#nc_005f_005fopen
    I could try to get a version of the source code, make the appropriate changes (there are only two calls to nc_open in the whole NCO source) and then build the library and test it. However, the last time I tried to build NCO from source, I ended up wasting a lot of time and ultimately failed. So I might need some help with this.
    Andy

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.