Re: [Pytables-users] Re: Re: nctoh5 script

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

A Dimarts 02 Novembre 2004 23:56, Jeff Whitaker va escriure:
> Francesc:  There was no good reason for using Array instead of EArray 
> for rank-1 variables, other than I wasn't sure EArrays were appropriate 
> for variables that were not going to be appended to.

Yes, they are. The only reason for using an Array instead of an EArray is a
matter of simplicity. For me, creating Array objects from the interactive
console is far easier than EArray, but after the creation, both objects
works very similar. However, EArray do support filters (apart of being
extensible), so, for programs, and if compression is desirable, I do
recommend using EArray objects for everything, even when you don't want to
enlarge them.

> If I replace
> 
>                 vardata.append(var[n:n+1])
> 
> with
> 
>             if dtype == 'c':
>                 chararr = numarray.strings.array(var[n].tolist())
>                 newshape = list(chararr.shape)
>                 newshape.insert(0,1)
>                 chararr.setshape(tuple(newshape))
>                 vardata.append(chararr)
>             else:
>                 vardata.append(var[n:n+1])
> 
> 
> it seems to work.  Note that I have to reshape the chararray to have an 
> extra singleton dimension or pytables complains that the data being 
> appended has the wrong shape.   This is also the reason I had to use 
> var[n:n+1] instead of var[n] in the append.  Is there a better way to do 
> this?

Well, this is a subtle problem, as append(array) expects array to be of the
same shape as the atomic shape. When the array has a dimension less, I guess
it would be safe to suppose that what the user wants is to add a single row
in the extensible dimension. Frankly, I don't know if implementing this kind
of behaviour would help the user to understand how append() works.

By the way, I've solved in PyTables the problem when the object to append is
a Numeric object with Char type ('c' typecode). I'm attaching my new version
for your inspection (beware, this will run only with PyTables 0.9!).

You surely have noted that my code may convert NetCDF files with enlargeable
dimensions other that the first one. I don't know whether this is suported
in NetCDF or not. I only know that Scientific Python does not seem to
support that.

Finally, the new (attached) version does the copy in buckets of records
instead of just one single record at a time. That improves the conversion
speed considerably for large variables. 

Original code:

$ ./nctoh5.bck -vo /tmp/test.nc /tmp/test-3-old.h5
+=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=+
Starting conversion from /tmp/test.nc to /tmp/test-3-old.h5
Applying filters: None
+=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=+
Number of variables copied: 2
KBytes copied: 7812.969
Time copying: 38.142 s (real) 36.88 s (cpu)  97%
Copied variable/sec:  0.1
Copied KB/s : 204

Bucked conversion code:

$ ./nctoh5 -vo /tmp/test.nc /tmp/test-3.h5
+=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=+
Starting conversion from /tmp/test.nc to /tmp/test-3.h5
Applying filters: None
+=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=++=+
Number of variables copied: 2
KBytes copied: 7812.969
Time copying: 5.55 s (real) 5.42 s (cpu)  98%
Copied variable/sec:  0.4
Copied KB/s : 1407

Maybe this is not important in many situations, but well, I think this is
not going to hurt neither <wink>

Cheers,

-- 
Francesc Alted

----------------------------------------------------------
#!/usr/bin/env python

"""
convert netCDF file to HDF5 using Scientific.IO.NetCDF and PyTables.
Jeff Whitaker <jef...@no...>

Added some flags to select filters, as well as some small improvements.
Francesc Altet <fa...@ca...>

This requires Scientific from 
http://starship.python.net/~hinsen/ScientificPython

"""
import Scientific.IO.NetCDF as NetCDF
import tables, sys, os.path, getopt, time

def nctoh5(ncfilename, h5filename, filters, overwritefile):
    # open netCDF file
    ncfile = NetCDF.NetCDFFile(ncfilename, mode = "r")
    # open h5 file
    if overwritefile:
        h5file = tables.openFile(h5filename, mode = "w")
    else:
        h5file = tables.openFile(h5filename, mode = "a")        
    # loop over variables in netCDF file.
    nobjects = 0; nbytes = 0  # Initialize counters
    for varname in ncfile.variables.keys():
        var = ncfile.variables[varname]
        vardims = list(var.dimensions)
        vardimsizes = [ncfile.dimensions[vardim] for vardim in vardims]
        # Check if any dimension is enlargeable
        extdim = -1; ndim = 0
        for vardim in vardimsizes:
            if vardim == None:
                extdim = ndim
                break
            ndim += 1
        # use long_name for title.
        if hasattr(var,'long_name'):
            title = var.long_name
        else: # or, just use some bogus title.
            title = varname + ' array'
        # Create an EArray to keep the NetCDF variable
        if extdim < 0:
            # Make 0 the enlargeable dimension
            extdim = 0
        vardimsizes[extdim] = 0
        dtype=var.typecode()
        if dtype == 'c':
            # Special case for Numeric character objects
            # (on which base Scientific Python works)
            atom = tables.StringAtom(shape=tuple(vardimsizes), length=1) 
        else:
            atom = tables.Atom(dtype=var.typecode(), shape=tuple(vardimsizes))
        vardata = h5file.createEArray(h5file.root, varname,
                                      atom, title, filters=filters,
                                      expectedrows=vardimsizes[extdim])
        # write data to enlargeable array one chunk of records at a time.
        # (so the whole array doesn't have to be kept in memory).
        nrowsinbuf = vardata._v_maxTuples
        # The slices parameter for var.__getitem__()
        slices = [slice(0, dim, 1) for dim in var.shape]
        # range to copy
        start = 0; stop = var.shape[extdim]; step = 1
        # Start the copy itself
        for start2 in range(start, stop, step*nrowsinbuf):
            # Save the records on disk
            stop2 = start2+step*nrowsinbuf
            if stop2 > stop:
                stop2 = stop
            # Set the proper slice in the extensible dimension
            slices[extdim] = slice(start2, stop2, step)
            vardata.append(var[tuple(slices)])
        # Increment the counters
        nobjects += 1
        nbytes += reduce(lambda x,y:x*y, vardata.shape) * vardata.itemsize
        # set variable attributes.
        for key,val in var.__dict__.iteritems():
            setattr(vardata.attrs,key,val)
        setattr(vardata.attrs,'dimensions',tuple(vardims))
    # set global (file) attributes.
    for key,val in ncfile.__dict__.iteritems():
        setattr(h5file.root._v_attrs,key,val)
    # Close the file
    h5file.close()
    return (nobjects, nbytes)

usage = """usage: %s [-h] [-v] [-o] [--complevel=(0-9)] [--complib=lib] [--shuffle=(0|1)] [--fletcher32=(0|1)] netcdffilename hdf5filename
 -h -- Print usage message.
 -v -- Show more information.
 -o -- Overwite destination file.
 --complevel=(0-9) -- Set a compression level (0 for no compression, which
     is the default).
 --complib=lib -- Set the compression library to be used during the copy.
     lib can be set to "zlib", "lzo" or "ucl". Defaults to "zlib".
 --shuffle=(0|1) -- Activate or not the shuffling filter (default is active
     if complevel>0).
 --fletcher32=(0|1) -- Whether to activate or not the fletcher32 filter (not
     active by default).
\n""" % os.path.basename(sys.argv[0])

try:
    opts, pargs = getopt.getopt(sys.argv[1:], 'hvo',
                                ['complevel=',
                                 'complib=',
                                 'shuffle=',
                                 'fletcher32=',
                                 ])
except:
    (type, value, traceback) = sys.exc_info()
    print "Error parsing the options. The error was:", value
    sys.stderr.write(usage)
    sys.exit(0)

# default options
verbose = 0
overwritefile = 0
complevel = None
complib = None
shuffle = None
fletcher32 = None

# Get the options
for option in opts:
    if option[0] == '-h':
        sys.stderr.write(usage)
        sys.exit(0)
    elif option[0] == '-v':
        verbose = 1
    elif option[0] == '-o':
        overwritefile = 1
    elif option[0] == '--complevel':
        complevel = int(option[1])
    elif option[0] == '--complib':
        complib = option[1]
    elif option[0] == '--shuffle':
        shuffle = int(option[1])
    elif option[0] == '--fletcher32':
        fletcher32 = int(option[1])
    else:
        print option[0], ": Unrecognized option"
        sys.stderr.write(usage)
        sys.exit(0)

# if we pass a number of files different from 2, abort
if len(pargs) <> 2:
    print "You need to pass both source and destination!."
    sys.stderr.write(usage)
    sys.exit(0)

# Catch the files passed as the last arguments
ncfilename = pargs[0]
h5filename = pargs[1]

# Build the Filters instance
if (complevel, complib, shuffle, fletcher32) == (None,)*4:
    filters = None
else:
    if complevel is None: complevel = 0
    if complevel > 0 and shuffle is None:
        shuffle = 1
    else:
        shuffle = 0
    if complib is None: complib = "zlib"
    if fletcher32 is None: fletcher32 = 0
    filters = tables.Filters(complevel=complevel, complib=complib,
                             shuffle=shuffle, fletcher32=fletcher32)

# Some timing
t1 = time.time()
cpu1 = time.clock()
# Copy the file
if verbose: 
    print "+=+"*20
    print "Starting conversion from %s to %s" % (ncfilename, h5filename)
    print "Applying filters:", filters
    print "+=+"*20

# Do the conversion
(nobjects, nbytes) = nctoh5(ncfilename, h5filename, filters, overwritefile)

# Gather some statistics
t2 = time.time()
cpu2 = time.clock()
tcopy = round(t2-t1, 3)
cpucopy = round(cpu2-cpu1, 3)
tpercent = int(round(cpucopy/tcopy, 2)*100)
if verbose:
    print "Number of variables copied:", nobjects
    print "KBytes copied:", round(nbytes/1024.,3)
    print "Time copying: %s s (real) %s s (cpu)  %s%%" % \
          (tcopy, cpucopy, tpercent)
    print "Copied variable/sec: ", round(nobjects / float(tcopy),1)
    print "Copied KB/s :", int(nbytes / (tcopy * 1024))