Menu

Concatation very slow

Help
2018-09-06
2018-09-07
  • R. Checa-Garcia

    R. Checa-Garcia - 2018-09-06

    Hi,

    Firsts, thanks for developing the nco library!

    I am right now trying to use it to concatenate several files (about 2000 files). Each file is a tile/slide on a lat, lon with several variables (about 20) so I am concatenating first over longitudes and them over latitudes. I am using ncrcat because the size of each tile is not always the same. So basically I proceed as:

    for number in {0..40}; do
        npad=$(printf %02d $number) 
        echo $npad '...processing loop--------------------------------'
        for number2 in {0..48}; do
            n2pad=$(printf %02d $number2) 
            file='slice_'$n2pad'_'$npad'.nc'
            new="${file/.nc/_rec.nc}"
            echo '   Adjusting' $file ' to  '$new 
    
            # Delete unwanted variables and dim3
            ncks  -O -h -x -C -v vars90,vars70,dim3 $file $new  
    
            # Ensure that we reorder to have aggregating dimension first
            ncpdq -O -h -a lon,lat           $new $new
    
            # Aggregating dimension is defined as unlimited. 
            ncks  -O -h --mk_rec_dmn lon     $new $new
    
        done
        # Now we should have a list of *_rec.nc files to aggregate on lon dim.
        # and aggregate all these files on longitude
    
        ncrcat -O -h *_rec.nc lon_added.nc 
    
        echo 'ok aggregation lon'  #this works perfectly!!
    
        # now we revert the lon to typical non-record dimension
        ncks  --fix_rec_dmn lon lon_added.nc lon_added_fix.nc
    
        # now we reorder the variables to be lat,lon
        ncpdq -O -h -a lat,lon lon_added_fix.nc lon_added_$npad.nc
    
        # and finally we define lat as record/unlimited dimension
        ncks --mk_rec_dmn lat 'lon_added_'$npad'.nc' 'lon_added_'$npad'_rec_lat.nc'
    
        rm lon_added.nc lon_added_fix.nc
        rm slice_*_rec.nc  # we clean all the files before next loop step.
    done
    # all the files named 'lon_added_'$npad'_rec_lat.nc' can be aggreated on lat
    # and no other files are named with regular expression '*_rec_lat.nc', so
    
    ncrcat *_rec_lat.nc  latlon_added_temp.nc 
    
    # now we revert the lon to typical non-record dimension
    ncks --fix_rec_dmn lat latlon_added_temp.nc latlon_added_fix.nc  
    
    # finally we ensure that file is lat,lon order
    ncpdq -O -h -a lat,lon latlon_added_fix.nc latlon_added.nc 
    

    Inital ncinfo of each slide file is like:

    ncinfo HWSD_VARIABLES_slice_32_02.nc
    <type 'netCDF4._netCDF4.Dataset'>
    root group (NETCDF4 data model, file format HDF5):
        dimensions(sizes): lat(440), lon(880), dim3(12)
        variables(dimensions): float64 lat(lat), float64 lon(lon), int64 dim3(dim3), uint16 index_WSD(lat,lon), var1(lat,lon)......
        groups:
    

    After aggregate on lon and reorder dimensions:

    <type 'netCDF4._netCDF4.Dataset'>
    root group (NETCDF4 data model, file format HDF5):
        history: Wed Sep  5 23:03:29 2018: ncks --mk_rec_dmn lat lon_added_31.nc lon_added_31_rec_lat.nc
    Wed Sep  5 23:00:35 2018: ncks --fix_rec_dmn lon lon_added.nc lon_added_fix.nc
        NCO: "4.6.3"
        dimensions(sizes): lat(440), lon(43151)
        variables(dimensions): uint16 index_WSD(lat,lon), var1(lat,lon)......
        groups: 
    

    Each file is large (about 4Gb) but I simply tried to aggregate two of thems with dimensions
    dimensions(sizes): lat(440), lon(43151)
    dimensions(sizes): lat(441), lon(43151)
    By changing the first loop to

    for number in {0..1}; do
    

    I waited more than 4 hours and it never finished. I tried also outside of any loop with same results. I tried to compress two of them that I know are mostly zeros (and change to 4G to about 400Mb) and also they seems to be imposible to concatenate.

    Maybe I am doing something wrong or simply it is a very slow process?

    Thanks in advance,
    Ramiro.

     

    Last edit: R. Checa-Garcia 2018-09-06
    • Charlie Zender

      Charlie Zender - 2018-09-06

      Your question is too intricate for me to follow all the information. However, a few general points about large files are in order. 1. Read the manual about the--no_tmp_fl option and use it if warranted. 2. netCDF4 chunking is a two-edged sword. You might try converting to netCDF3 first and then concatenating. 3. It looks from your script like you have an advanced understanding of NCO. Feel free to post a narrower question, realizing there may be no better answer than the two suggestions I just made.
      cz

       
  • R. Checa-Garcia

    R. Checa-Garcia - 2018-09-07

    Thanks Charlie for the advises,

    Sorry for the large question I submitted. The more narrower question would be if there is any option that could potentially accelerate a contatenation of netCDF files over two record dimensions (first one then the other) of large files. I understand from your reply that netCDF3 might be faster which is a very useful information for me. I will also read the manual regarding --no_tmp_fl.

    Thanks,

    PD. I don't know if I could safely proceed to concatenate netCDF files that are also compressed.

     
    • Charlie Zender

      Charlie Zender - 2018-09-07

      It's safe to proceed. But the compression may be slowing things down greatly. converting to netCDF3 (and/or de-compressing with ncks -L 0) might speed things up considerably.

       

Log in to post a comment.

MongoDB Logo MongoDB