NCO netCDF Operators / Discussion / Open Discussion: Very slow I/O on large blocksize filesystems

Gary Strand - 2011-12-22

(Preface: I've been a very happy user of nco for many years)

Technical details:
NCO netCDF Operators version "4.0.8" last modified 2011/04/26 built Oct 18 2011 on mirage4 by jam
ncks version 4.0.8
Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
Copyright (C) 1995-2011 Charlie Zender

Problem: This issue may be related to the NOFILL issue with netCDF 4.1.2; in any case, on filesystems with large blocksizes (2M, for example, 'lustre' and NCAR's GLADE system) the I/O performance of even simple 'ncks' operations is horrible - time-to-completion ratios (compared to smaller blocksize filesystems) of 300:1 or even 1500:1 are not uncommon.

Investigation with NCAR CISL staff showed that a simple variable extraction that takes about 20 seconds on a small blocksize filesystem takes about 40 minutes on the GLADE filesystem (120:1 ratio) and that the following was found:

12/20/11 3:57 PM JAM
I should add that the actual performance for the first 39 minute test was around 30MB/sec for reads and 12MB/sec for writes. So nco may be doing something else inefficiently in addition to reading/writing extra data.

12/20/11 3:52 PM JAM
Hi, we've done some testing since first getting this ticket and have found that the performance of ncks on filesystems with large block sizes (most of Glade is at a 2MB block size) is VERY bad and it seems to be reading/writing much more data than necessary.

The test we used was: "ncks -x -v TH b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc"

- The input file is 3.3GB and the output file is 1.1GB. On an idle system (storm4) this command took around 39 minutes to complete when either input or output file is on a Glade filesystem. During this time 60GB was read from glade and 26GB was written.

- Adding the -4 option to ncks resulted in the test taking 10 minutes to complete with 30GB read and 1.2GB written.

- We also ran the same tests on a Lustre filesystem with a 1MB block size and saw similar bad performance.

- Finally, running the same test with both input/output files in /tmp (local drive, 4k block size) finished in 17 seconds.

I don't know exactly what ncks does or how it does it, but there seems to be an issue with large-block filesystems possibly causing it to read and write overlapping blocks of data, resulting in the very large numbers of extra bytes read/written listed above. Large-block filesystems also caused the silent data corruption issue with nco a few months back, which could be related.

With this information and your plots, the effect of system load and number of users is not as significant as we originally thought, and the bad performance on glade is most likely related to the actual amount of data being transferred (86GB in the worst case).

Is this an NCO problem or a netCDF-4 problem?


If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2011-12-22

Hey Gary,

That's crazy slow.
Thanks for reporting this. Have not test NCO w/ netCDF 4.1.3 on GLADE myself.
Will need to do so before having any substantive comments.
But, am leaving tomorrow on internet-less trip until 1/1.
So, two suggestions that may tide you over until 2012:
Try ~zender/bin/*/ncks, see if that works better.
Try older versions of /usr/local/bin/ncks, that CISL usually puts in other directories.
These could be affected by NOFILL bug, though, so best to try -4 on output files.
c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gary Strand - 2012-01-03

Thanks, Charlie.

I've played a bit with your suggestions, and haven't seen any improvement. I'd really rather prefer keeping the files I'm creating as netCDF-3, so the '-4' option isn't viable.

Thanks for looking into this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-01-03

I am back in the offfice after two weeks. I found my cryptocard. If my glade login still works then I will look at this today. Please post the location of your test case file so I can duplicate it exactly, i.e., what is the path to

b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gary Strand - 2012-01-03

See

/glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-01-03

Hi Gary,

I have reproduced the problem you are experiencing using NCO on the
large block filesystem (LBF) named GLADE. The binaries in
~zender/bin/ improve the performance by about a factor
of two relative to NCO 4.0.8, but something lower in the software
stack than NCO, i.e.g, the netCDF library or the filesystem itself
seems to cause the gross degradation in performance relative to
NCO on smaller block filesystems.

Without going into too much detail, and for the benefit and comment of
others following this issue, my conclusions about the slow performance
of NCO on LBFs (i.e., GLADE) on both AIX and Linux are:

0. NCO (and ncks in particular) doesn't use any fancy algorithms.
NCO uses only offical, documented netCDF API calls to do its work.
NCO does not pay attention to block-sizes. Unless hyperslabbing is
requested, NCO transfers entire variables with _one call_ (rather than
with continuous/consecutive calls) to nc_var_.
1. Slow performance on LBFs is experienced when any version of NCO is
linked to any netCDF version including 4.1.3. I tested this with NCO
3.9.6 on AIX (using the bluefire default, i.e., /usr/local/bin/ncks
which is linked to netCDF 3.6.2), and Gary or CISL tested this with
NCO 4.0.8 on Linux (unsure what library they used).
2. NCO version 4.0.8 worsens the performance relative to other
versions of NCO, but changes in 4.0.8 do not cause the underlying
problem. 4.0.8 uses netCDF fill-mode to workaround the netCDF 4.1.2
(and all preceding versions) "NOFILL" bug. This causes 4.0.8 to write
(at least) twice as much data as other versions of NCO.
3. NCO version 4.0.9, which is in beta and not yet released, improves
the performance by about a factor of two relative to 4.0.8. This is
consistent with the reversion of 4.0.9 to previous NCO behavior which
utilizes the netCDF NOFILL feature to reduce writes by (at least) a
factor of two. It is only safe to use NCO 4.0.9+ with netCDF
4.1.3+. Otherwise the netCDF NOFILL bug may be triggered.
4. NCO operations on LBFs are twice as fast on Linux as on AIX.
Extracting large datasets to netCDF3 files rather than netCDF4 files
takes ~2.5 times as long. These factors are independent, so the best
performance on large block filesystems is obtained with NCO 4.0.9 (or
any NCO except 4.0.8) under Linux writing netCDF4 files. The worst
performance will be with NCO 4.0.8 under AIX writing netCDF3 files.
5. Improving NCO performance on LBFs may require more detailed
performance analysis and algorithms for sub-setting. An obvious place
to start is to use a blocksize-sensitive copy size. Recent versions of
nccopy use such an algorithm, I believe. However, this would require a
significant code refactoring for NCO, which is not currently funded.
However, NASA may fund implementation of groups in NCO. More on that
in coming weeks. Maybe those funds can leverage some of this work.
6. Having written this much I'd like to hear from others before
blabbing-on. I wasn't aware there was any penalty for LBFs, so credit
goes to Gary for reporting the dramatic slow-downs on GLADE.
Any good ideas for methods to speed up netCDF3 writes on LBFs?
Are these performance penalties for LBFs better understood by others?

Charlie

Output of selected commands (extraneous stuff deleted):

# Copying 3 GB takes ~1 minute with AIX on GLADE
zender@be1005en:~$ time /bin/cp /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc ~/gary.nc
real    1m3.219s

# Copying 3 GB takes ~30 seconds with Linux on GLADE, twice as fast as AIX
zender@mirage0:~$ time /bin/cp /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc ~/gary.nc
real    0m30.812s

# Test case takes ~8 minutes with ncks 3.9.6 on AIX
zender@be1005en:~$ /usr/local/bin/ncks -lbr
Linked to netCDF library version "3.6.2", compiled Apr 3 2007 14:19:36
zender@be1005en:~$ /usr/local/bin/ncks -vrs
NCO netCDF Operators version "3.9.6" last modified 2009/01/21 built Jan 28 2009 on be1105en by ddvento
ncks version 3.9.6
zender@be1005en:~$ time /usr/local/bin/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_3.9.6.nc
real    8m9.658s

# Test case takes ~8 minutes with ncks 4.0.9 on AIX
zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks -lbr
Linked to netCDF library version 4.1.3, compiled Aug 25 2011 08:32:40
zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks -vrs
NCO netCDF Operators version 20120103 built Jan 3 2012 on be1005en.ucar.edu by zender
zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_4.0.9.nc
real    7m48.197s

# netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on AIX
zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -4 -D 3 -x -v TH ~/gary.nc ~/out4_blf_4.0.9.nc
real    2m42.123s

# Test case takes ~4 minutes with ncks 4.0.9 on Linux
zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks -lbr
Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks -vrs
NCO netCDF Operators version 20120103 built Jan 3 2012 on mirage0 by zender
zender@mirage0:~$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_mrg_4.0.9.nc
real    4m15.493s

# netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on Linux
zender@mirage0:~/nco$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -4 -D 3 -x -v TH ~/gary.nc ~/out4_mrg_4.0.9.nc
real    1m44.345s

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-06

It's taken a while, but I think I have a fairly complete explanation
for this slowdown, when it occurs, and how to deal with it.

The slowdown occurs when all these conditions are met:

1. You're dealing with netCDF classic or 64-bit offset format
      files, not netCDF-4 or netCDF-4 classic model files.

2. You have an unlimited dimension and many record variables that
      use it.

3. The file system has a large block size, the atomic size for
      disk access.

In this case, doing things a variable at a time instead of a record at
a time can be very slow, because accessing all the data in a variable
(or some part of each record for a variable) typically reads each
record multiple times, once for each record variable you're dealing
with. That's because the block size is larger than it needs to be to
hold a record's worth of data for each variable, so accessing the nth
record's data for a variable typically reads in more data than is
needed.

Consider a case that's not too atypical: a block size of 2 MiBytes,
365 records, and 100 float record variables, each dimensioned
(time=365, lat=73, lon=145) where time is the record dimension. A
record's worth of data for each variable is only 73*145*4 = 42340
bytes, and each variable has 365 records. So reading one variable of
size 365*73*145*4 of about 15.5 Mbytes actually reads 365 disk blocks,
which is 365*2097152 bytes or 765 Mbytes. That's about 50 times more
bytes read than needed. If you operate on every variable in the file,
one at a time, the result is 50 times more I/O than necessary, which
explains why it might be 50 times slower than it would be if you used
fixed size variables, stored contiguously, rather than record
variables, stored in pieces scattered throughout the records of the
file.

How can you deal with this to get efficient processing of such files?
Here are some workarounds and solutions:

1. Don't use the unlimited dimension if you don't really need it.

2. Make sure the record size of each variable is at least as big as
    the disk block size.

3. Convert your record-oriented file to a file with only fixed size
    dimensions before using it in processing. There's an nco operator
    for this, or you can use "nccopy -u infile outfile" to make the
    unlimited dimension a fixed size.

4. Change the processing algorithms to read input a record at a time
    instead of a variable at a time, processing all the record
    variables after each record variable has been read.

5. Use netCDF-4 classic model files or regular netCDF-4 files. With
    the netCDF-4/HDF5 format, data is accessed in disk blocks, if
    stored contiguously, or by chunks for chunked data. A chunk only
    contains data from a single variable. Making chunks larger than
    disk blocks insures that I/O will be efficient. If data is
    compressed, each chunk is compressed separately, so if compressed
    chunks are much smaller than the disk block size, inefficiencies
    may still occur.

I will be using approach number 4 to detect and deal with this
situation on systems with large block size. If anyone has useful
heuristics for this, I'd like to hear about them.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-06

A couple of typos/brainos slipped through that may have made that last post unclear:

> 4. Change the processing algorithms to read input a record at a time
>     instead of a variable at a time, processing all the record
>     variables after each record variable has been read.

should have been

4. Change the processing algorithms to read input a record at a time
      instead of a variable at a time, processing all the record
      variables after each record has been read.

and

> I will be using approach number 4 to detect and deal with this
> situation on systems with large block size.

should have been

I will be using approach number 4 in nccopy to detect and deal with this
situation on systems with large block size.

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-06

Thanks for figuring this out, Russ.
It makes sense and I learned more about what blocksize really means.

The relevant NCO command is

ncks -fix_rec_dmn in.nc out.nc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-07

I've added a couple of hundred lines of C code to nccopy to implement this new
rule of thumb:

When accessing data from netCDF classic or 64-bit offset format files
that have multiple record variables and a lot of records on a file
system with large disk block size relative to a record's worth of data
for one or more record variables, access the data a record at a time
instead of a variable at a time.

The new version of nccopy will be in the upcoming 4.2 release.

The improvement this makes in a typical case of using nccopy to convert or copy
a 1.13 GB sample file that has lots of record variables, on a system with 512KB
disk block size is significant: a factor of 24 in elapsed time. I just tried it on the
same sample file and the /glade/scratch file system on one of NCAR CISL's platforms,
and there the result was a factor of 60 speedup.

It's unfortunate that this is not a problem that can be fixed in the library, it's a matter of
using the above rule of thumb in applications that make use of the library, which requires
a special case. However, it turns out that I didn't have to bother with testing the disk
block size at all in the new nccopy code. The special case code works fine and is no
slower even when the disk block size is small, like 4096 bytes. It just speeds things up
more for a larger disk block size.

Actually we may eventually be able to speed things up in the netCDF library as well for
classic-format files by replacing all the (over-)optimized code that currently uses read(2)
and write(2) system calls with ordinary stdio fread(3) and fwrite(3) calls, but that may have
to wait for version 4.3 …

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-08

Interesting. It should be possible to implement a similar patch to ncks et al. before the next release. This is now TODO nco1036.
c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Si Liu - 2012-03-08

Hi Russ and Charlie,

Thank you for all your kind help. I have re-compiled Russ's modified version of netcdf-4.2 on our machine (NCAR mirage/storm)
and the huge performance raise of nccopy on netcdf3 files has been noticed.

We found that nccopy accesses the data a variable at a time instead of a record at a time.
Besides nccopy, are there any other (netcdf) functions that access the data a variable at a time and need to be modified?

I once assumed that NCO functions, e.g. ncks, call nccopy and other netcdf functions directly. So fixing nccopy and other functions would fix NCO problems without touching NCO source code. It looks like I am wrong on this. NCO functions also need to read the data and decide if they read a variable or a record by themselves. Could you confirm this?
If that is the case, how many NCO functions need to be modified? Is that a lot of extra work?

For Gary's ncks problem, I have a temporary but ugly solution. I can create a wrapper to:
1) convert netcdf3 format to netcdf4 format using new nccopy
2) ncks implementation on netcdf4 format
3) convert netcdf4 format back to netcdf3 format if necessary
All steps will be very efficient on large-block file systems now using the new nccopy command.

Best wishes,
Sincerely Si Liu
NCAR CISL Consulting

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-08

Hi Si,

Your understanding is correct.

The most important place NCO needs to be patched is nco_cpy_var_val() in file nco_var_utl.c.
That would solve Gary's problem and is a relatively straightforward change to make.
There is one other routine used for copying when limits are specified, nco_cpy_var_val_mlt_lmt() in file nco_msa.c, but that is lower priority because it is much more complex and it is invoked less often.

Although it is high on our TODO list, we will gladly accept patches!
cz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-08

Hi Russ,

I can't find your nccopy patch in the daily snapshot of netCDF 4.2.
Is it in nccopy.c? If so, which lines approximately contain the core of the patch? If not, where is it?

Thanks,
c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-09

Hi Charlie,

It's apparently not in the daily snapshot yet, just the svn trunk for 4.2. You can access that from http://svn.unidata.ucar.edu/repos/netcdf/trunk/. The new code is mainly in ncdump/nccopy.c, with a function prototype declaration and code moved from dumplib.h/dumplib.c to utils.h/utils.c in the same directory. The code is isolated in the functions called from the copy() function in nccopy.c, in this block:

    /* For performance, special case netCDF-3 input file with record
     * variables, to copy a record-at-a-time instead of a
     * variable-at-a-time. We should also eventually do something
     * similar for netCDF-3 output file, but converting netCDF-4 files
     * to netCDF-3 files is less common … */
    if(nc3_special_case(igrp, inkind)) {
size_t nfixed_vars, nrec_vars;
int *fixed_varids;
int *rec_varids;
NC_CHECK(classify_vars(igrp, &nfixed_vars, &fixed_varids, &nrec_vars, &rec_varids));
NC_CHECK(copy_fixed_size_data(igrp, ogrp, nfixed_vars, fixed_varids));
NC_CHECK(copy_record_data(igrp, ogrp, nrec_vars, rec_varids));
    } else {
NC_CHECK(copy_data(igrp, ogrp)); /* recursive, to handle nested groups */
    }

I'm open to suggestions for making the code clearer or any other improvements. I'm also happy to answer questions about it ..

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-09

Thanks. OK. This makes the desired algorithm clear. I see you do not actually check the blocksize. This is because there is
no penalty for adopting this method on regular/small blocksize filesystems? Also, the algorithm copies data for the same record for all variables, then moves to the next record until there are no more records. Would it also work (i.e., be fast) to copy all of one variable, one record at a time, and then move to the next variable? Or have you ruled that out as being slow?
Thanks,
c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-09

In short, yes, yes, no, and no.

The old nccopy (before implementing the special case for classic format files with multiple record variables) copied all of each variable whether a record variable or a fixed-size variable, and was too slow when there were lots of record variables.

I've generated a perverse netCDF example file that demonstrates this performance problem even on file systems with disk block sizes of only 4096 bytes. The file, which I've christened perverse.nc, has 2000 record variables, 4000 records, and is only 16 MB as a classic model file. If you want to use it for testing, it's here:

http://www.unidata.ucar.edu/staff/russ/public/perverse.nc

Using the old nccopy on this relatively small netCDF file, using variable-at-a-time processing, takes about 30 seconds on my desktop Linux platform. With the improved nccopy that implements the special case code, it's nearly an order of magnitude faster:

work/ncdump$ clear_cache && /usr/bin/time nccopy perverse.nc tmp.nc
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
2.97user 24.74system 0:28.79elapsed 96%CPU (0avgtext+0avgdata 7740maxresident)k
37744inputs+31448outputs (25major+1974minor)pagefaults 0swaps
work/ncdump$ clear_cache && /usr/bin/time ./nccopy perverse.nc tmp.nc
+ sync
+ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
1.90user 0.07system 0:03.14elapsed 62%CPU (0avgtext+0avgdata 3632maxresident)k
44072inputs+31448outputs (42major+4481minor)pagefaults 0swaps

To make measurements like this vallid, it's necessary to clear the I/O buffer cache to make sure reads are really coming from disk rathe than from memory, and for that I use this little shell script I inherited from Ed Hartnett, named "clear_cache.sh", that works on Linux systems:

#!/bin/bash -x
# Clear the disk caches.
sync
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-09

Oops, correct that first line:
> In short, yes, yes, no, and no.
In short, yes, yes, no, and yes.

That is, yes, I've ruled out variable-at-a-time processing as being too slow for record variables when there are lots of them and lots of records.

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-13

Gary + Russ,

I just committed ncks code which implements Russ' algorithm to hasten
copying of record variables under circumstances that could cause
slowness on Large Blocksize Filesystems or perverse files.
Then I re-compiled and re-ran the commands shown in message #7 here

https://sourceforge.net/projects/nco/forums/forum/9829/topic/4898620

This tests Gary's test file.

With unpatched ncks code this took 4m15s on Linux, 7m48s on AIX.
The new ncks results are 0m29s on Linux, 1m04s on AIX.
The tests in message #7 show that the new times are very close to the
time for a raw file copy using /bin/cp.
So the factor of eight speed-up in each case is as good as can be.

Below are the commands used to generate the tests.
The code is now on the NCO trunk, tagged as nco-4_1_0.
I plan to release 4.1.0 soon after Unidata releases netCDF 4.2.0.
I don't want to scoop Russ, since it's his algorithm I implemented.
So right now what I'm calling NCO 4.1.0 is really beta code.
NCO users can always install the latest code or use the executables
in ~zender/bin/AIX if they wish. These are also stored here:

http://nco.sf.net/src/nco-4.1.0.aix53.tar.gz

time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_blf_4.1.0.nc

time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x -v TH ~/gary.nc ~/out3_mrg_4.1.0.nc

Feedback welcome, hope this helps,
Charlie

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Si Liu - 2012-03-13

Thank you, Charlie.
I will recompile it and double check the results on glade file system soon.

Si

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Charlie Zender - 2012-03-15

Hi Russ,

I'm tidying up that patch in NCO, and have some further questions.
In order to prioritize patching more of NCO, I want to know how the
slowdown reading compares to the slowdown writing.
To simplify my questions, let's use the abbreviations MM3 and
MM4 for netCDF3 and netCDF4 Multi-record Multi-variable files,
respectively. My understanding is that MM3s are susceptible to the
slowdown, while MM4s are not, and that writing MM3s without the
patch incurs incurs more of a penalty than reading MM3s.
So this is how I prioritize implementing the MM3 patch:

1. When copying MM3 to MM3. Done in ncks, TBD in others.
2. When copying MM4 to MM3. Done in ncks, TBD in others.
3. When copying MM3 to MM4. Not done anywhere.
4. When reading MM3 and not writing anything. Not done anywhere.

Currently ncks always uses the algorithm for cases 1 and 2 (i.e.,
whenever writing to an MM3), but not for cases 3 and 4.

The rest of NCO does not yet use the MM3 algorithm, yet there are many
places where it would potentially benefit. I've heard through the
years that sometimes ncecat slows to a crawl. Perhaps the MM3 slowdown
is responsible. On the bright side, ncra and ncrcat are immune from
the slowdown because they already read/write all record variables
record-by-record.

Does the prioritization above make sense? If so I will next patch
the rest of NCO to do cases 1 and 2, before patching anything to do
cases 3 and 4.

c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-15

Hi Charlie,

That sounds right to me, because when you're just reading a small
portion of a disk block, you only incur the extra time for reading
data you won't use., but when you're writing, you have to read it all
in and rewrite the part you're changing as well as the data you didn't
change. So writing with large disk blocks would seem to require twice
the I/O of reading.

In nccopy, I just implemented the algorithm in cases 1 and 3; case 4
doesn't occur. I had thought case 2 currently wasn't very common, so
it could wait, but your question has led me to rethink this. A fairly
common use of case 2 is converting a compressed netCDF-4 classic model
file to an uncompressed classic file, for use with applications that
haven't been linked to a netCDF-4 library or in archives that will
continue to use classic format.

I've been trying to figure out whether implementing case 2 for
compressed input could require significantly more chunk cache than not
using the MM3 algorithm, if you want to avoid uncompressing the same
data over and over again. But I think just having enough chunk cache
to hold the largest compressed chunk for any record variable would be
sufficient, so I've tentatively concluded that it's not an issue.
(Where things get complicated is copying MM4 to MM4 while rechunking,
to improve access times for read access patterns that don't match the
way the data was written.)

Thanks for presenting your prioritization. It looks like I've got
some more work to do, implementing case 2 in nccopy.

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Russell K. Rew - 2012-03-15

Actually the fix to add case 2 to nccopy was only about 15 minutes of work, so it will be in release 4.2 …

-Russ

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Si Liu - 2012-03-26

Hi Charlie,

It took me a while to do more research on our WRF and CESM model and talk to some of our users.
I believe your priority order is great. Actually, most of our big model runs only focus on #1.
That is our main concern at this time.

Thanks you for all your help.
Sincerely Si

1. When copying MM3 to MM3. Done in ncks, TBD in others.
2. When copying MM4 to MM3. Done in ncks, TBD in others.
3. When copying MM3 to MM4. Not done anywhere.
4. When reading MM3 and not writing anything. Not done anywhere.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Si Liu - 2012-04-02

Hi Charlie and Russ,

One of our NCO users, Andy Mai, recently mentioned the nc__open solution to me for our slow performance problem.
His email is attached below. From my understanding, that solution will not solve all our problems. Please confirm that.

Charlie, what is the current status of your NCO fix? Will that take a long time to fix the rest of them?
If it still needs a long time to finish, can I have your current/latest version of NCO?
I can make it an in-official version on our machine and our users can use it instead of the old slow version.
We believe a partly fixed version is also going to help many of our NCO users.

Thanks a lot.
Sincerely Si

Email from Andy:
I went to Argonne for the semi-annual ParVis meeting. I talked to Rob Latham, a high-performance I/O expert. He told me to use nc__open rather than nc_open in the source code for NCO as a solution to our problem of needing 32 minutes to read a 3.2GB NetCDF file. He is certain that this will fix our problem. See documentation here:
http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-c/nc_005f_005fopen.html#nc_005f_005fopen
I could try to get a version of the source code, make the appropriate changes (there are only two calls to nc_open in the whole NCO source) and then build the library and test it. However, the last time I tried to build NCO from source, I ended up wasting a lot of time and ultimately failed. So I might need some help with this.
Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Very slow I/O on large blocksize filesystems

Command-line operators for netCDF and HDF files

Forums

Help

Very slow I/O on large blocksize filesystems

Very slow I/O on large blocksize filesystems

Command-line operators for netCDF and HDF files

Forums

Help

Very slow I/O on large blocksize filesystems document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Very slow I/O on large blocksize filesystems