nco-4.4.1 segfaults on netCDF-4 files, both OS X SL and RHEL5

Developers
2014-01-31
2014-02-05
  • I installed nco-4.4.1 from the release tar.gz file on OS X Snow Leopard (using a locally modified Portfile with the macports libraries) to get around the 256 files open problem, but the new version segfaults on netCDF-4 files. A version compiled from yesterday's snapshot on RHEL5 has the same problem:

    $ echo 20020131*day*
    20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA16_G_\
    2002031_day-v02.0-fv01.0.nc
    $ ncdump -k 20020131*day*
    netCDF-4
    $ ncks -D 5 20020131*day*
    ncks: INFO nco_fl_open() reports nc__open() will request file buffer of default size
    ncks: INFO nco_fl_open() reports nc__open() opened file with buffer size = 0 bytes
    ncks: INFO Extended filetype of 20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_\
    Pathfinder-PFV5.2_NOAA16_G_2002031_day-v02.0-fv01.0.nc is NC_FORMAT_UNDEFINED, mode = 0
    Segmentation fault
    

    On OS X (after deactivating 4.4.1 and activating 4.4.0) ncks from 4.4.0 still works for netCDF-4 files.

     
  • Charlie Zender
    Charlie Zender
    2014-01-31

    unable to reproduce on linux. please post the input file. thx, cz

     
  • The file is from the GHRSST Pathfinder v 5.2 data set:

    http://data.nodc.noaa.gov/pathfinder/Version5.2/2002/20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA16_G_2002031_day-v02.0-fv01.0.nc

    Both my OS X and RHEL5 compiles of 4.4.1 used the same configure options and libraries that worked for 4.4.0.

     
    • Pedro Vicente
      Pedro Vicente
      2014-02-01

      This was fixed.

      To get the current NCO snapshot, please do

      cvs -z3 -d:pserver:anonymous@nco.cvs.sf.net:/cvsroot/nco co -kk nco

      Pedro

       
      • Thanks. I tried building current (9PM AST) cvs snapshot on Ubuntu 13.10 and ncks got a bit further, but still ended with SEGFAULT:

        $ NETCDF4_ROOT=/usr ./configure --prefix=/opt/nco.sf.net \
        --enable-netcdf4
        [...]
        Configuration Parameters:
        AR_FLAGS............. cru
        CC................... gcc -std=gnu99
        CFLAGS............... -g -O2 -std=c99 -D_BSD_SOURCE -D_POSIX_SOURCE
        CPP.................. gcc -E
        CPPFLAGS............. -I/usr/include -DgFortran  -I/usr/include -I/usr/include
        CXX.................. g++
        CXXFLAGS............. -g -O2
        OPENMP_CFLAGS.........-fopenmp
        ENABLE_DAP_NETCDF.... yes
        ENABLE_DAP........... yes
        ENABLE_GSL........... yes
        HAVE_NETCDF4_H....... yes
        ENABLE_NETCDF4....... yes
        NETCDF4_ROOT......... /usr
        ENABLE_UDUNITS....... no
        ENABLE_UDUNITS2...... yes
        GSL_ROOT............. /usr
        HAVE_ANTLR........... runantlr
        HAVE_MAKEINFO.........no
        HOST................. 
        host................. x86_64-unknown-linux-gnu
        HOSTNAME............. UbuntuVM
        LDFLAGS.............. -L/usr/lib -lnetcdf  -L/usr/lib/x86_64-linux-gnu
        LIBS................. -ludunits2 -lexpat -lgsl -lm -lnetcdf -lnetcdf -lnetcdf  -lcurl -L/usr/lib -lgsl -lgslcblas -lm -ludunits2
        install prefix ...... /opt/nco.sf.net
        

        I didn't have the same test file available at home, but used another from the same GHRSST version:

        gwhite@UbuntuVM:~/Documents/nco/src/nco$ libtool --mode=execute gdb ./ncks GNU gdb (GDB) 7.6.1-ubuntu
        Copyright (C) 2013 Free Software Foundation, Inc.
        License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
        This is free software: you are free to change and redistribute it.
        There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
        and "show warranty" for details.
        This GDB was configured as "x86_64-linux-gnu".
        For bug reporting instructions, please see:
        <http://www.gnu.org/software/gdb/bugs/>...
        Reading symbols from /Data/Users/gwhite/Documents/nco/src/nco/.libs/lt-ncks...done.
        (gdb) run -D 5 /export/Data/GHRSST/PFV52/2004/20040628113136-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA17_G_2004180_day-v02.0-fv02.0.nc
        Starting program: /home/gwhite/Documents/nco/src/nco/.libs/lt-ncks -D 5 /export/Data/GHRSST/PFV52/2004/20040628113136-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA17_G_2004180_day-v02.0-fv02.0.nc
        [Thread debugging using libthread_db enabled]
        Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
        ncks: INFO nco_fl_open() reports nc__open() will request file buffer of default size
        ncks: INFO nco_fl_open() reports nc__open() opened file with buffer size = 0 bytes
        ncks: INFO Extended filetype of /export/Data/GHRSST/PFV52/2004/20040628113136-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA17_G_2004180_day-v02.0-fv02.0.nc is NC_FORMAT_UNDEFINED, mode = 0
        ncks: CONVENTION File "Conventions" attribute is "CF-1.5"
        Summary of /export/Data/GHRSST/PFV52/2004/20040628113136-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA17_G_2004180_day-v02.0-fv02.0.nc: filetype = NC_FORMAT_NETCDF4 (representation of extended/underlying filetype NC_FORMAT_UNDEFINED), 0 groups (max. depth = 0), 3 dimensions (2 fixed, 1 record), 14 variables (14 atomic-type, 0 non-atomic), 184 attributes (58 global, 0 group, 126 variable)
        

        So now filetype is netCDF-4, but continuing on:

        Root record dimension 0: name = time, size = 1
        
        Global attributes:
        
        Program received signal SIGSEGV, Segmentation fault.
        0x00007ffff75dbb7d in malloc_consolidate (
            av=av@entry=0x7ffff791d740 <main_arena>) at malloc.c:4102
        4102    malloc.c: No such file or directory.
        (gdb) bt
        #0  0x00007ffff75dbb7d in malloc_consolidate (
            av=av@entry=0x7ffff791d740 <main_arena>) at malloc.c:4102
        #1  0x00007ffff75dd0e1 in _int_malloc (av=0x7ffff791d740 <main_arena>, 
            bytes=2320) at malloc.c:3379
        #2  0x00007ffff75df4d0 in __GI___libc_malloc (bytes=2320) at malloc.c:2859
        #3  0x00007ffff7b87d58 in nco_malloc (sz=2320) at nco_mmr.c:108
        #4  0x00007ffff7b98354 in nco_prn_att (grp_id=65536, 
            prn_flg=prn_flg@entry=0x7fffffff6c10, var_id=var_id@entry=-1)
            at nco_prn.c:88
        #5  0x00007ffff7b6afb2 in nco_prn_att_trv (nc_id=65536, 
            prn_flg=prn_flg@entry=0x7fffffff6c10, trv_tbl=0x609190)
            at nco_grp_utl.c:456
        #6  0x0000000000404a25 in main (argc=<optimized out>, argv=<optimized out>)
            at ncks.c:909
        (gdb)
        
         
  • Pedro Vicente
    Pedro Vicente
    2014-02-02

    George, can you post a link to the new file or similar?

    With the previous GHRSST Pathfinder v 5.2 data set file the error does not happen anymore.

    Pedro

     
    • The link for the new file is:

      http://data.nodc.noaa.gov/pathfinder/Version5.2/2004/20040628113136-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA17_G_2004180_day-v02.0-fv02.0.nc

      I was able to do some tests with the old file and I get the same errors with either file on Ubuntu 13.10 and 12.04.

      I was also able to build nco (CVS of Feb. 2nd) on Mavericks (e.g., using Apple clang compilers). For Mavericks, ncks -D 5 20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA16_G_2002031_day-v02.0-fv01.0.nc processed the complete file without errors, but nces gave a "malloc: *** error for object 0xfffffffb: pointer not allocated" with exit code 134 after moving the temporary output file.

      I just tried building the same CVS of Feb. 2nd sources on Snow Leopard, but using Macport's clang-3.3 rather than Apple's gcc on a system where netcdf was built with gcc-4.7. Simple tests of both ncks and nces worked:

      ncks -D 5  20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA16_G_2002031_day-v02.0-fv01.0.nc
      for i in $(gseq -w 0 999) ; do ln 20020131153322-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA16_G_2002031_day-v02.0-fv01.0.nc foo${i}.nc ; done
      nces -D 5 foo00[12].nc bar002.nc
      

      On an SL system with netcdf built using gcc-4.8, however, I get the malloc error:

      $ nces -D 5 foo00* bar010.nc
      nces: DEBUG Demoting variable n1_map from type NC_DOUBLE to type NC_INT
      nces: DEBUG Demoting variable n1_st from type NC_DOUBLE to type NC_INT
      nces: INFO Moving bar010.nc.pid93142.nces.tmp to bar010.nc...done
      nces(93142) malloc: *** error for object 0x2: pointer being freed was not allocated
      *** set a breakpoint in malloc_error_break to debug
      Abort trap
      $ echo $?
      134
      

      With more testing it appears that gsl as well as hdf5 and netcdf must
      all be compiled with the same compiler (e.g, gcc47 or gcc48) to avoid malloc errors on Snow Leopard. A build using Apple llvm-gcc-4.2 and
      llvm-g++-42 passes my tests, and "make test" gives only 3 failures:

      Darwin ambrosia.mar.dfo-mpo.ca 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:32:41 PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64; i686-apple-darwin10-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.6);
      
                          Test Results                Seconds to complete
                   --------------------------   ----------------------------------------
            Test   Success    Failure   Total   WallClock    Real   User  System    Diff
           ncap2:       11                 11       0.31     0.00   0.00    0.00    0.00
         ncatted:        4          1       5       0.19     0.00   0.00    0.00    0.00
            ncbo:       15                 15       0.92     0.00   0.00    0.00    0.00
         ncflint:        5                  5       0.29     0.00   0.00    0.00    0.00
            nces:        9          1      10       0.28     0.00   0.00    0.00    0.00
          ncecat:        3                  3       0.17     0.00   0.00    0.00    0.00
            ncks:       18                 18       0.83     0.00   0.00    0.00    0.00
           ncpdq:       35                 35       1.12     0.00   0.00    0.00    0.00
            ncra:       22          1      23       0.68     0.00   0.00    0.00    0.00
          ncrcat:       23                 23       2.13    10.00   0.00    0.00   10.00
            ncwa:       43                 43       1.52     0.00   0.00    0.00    0.00
      

      These three "failures" appear harmless:

      ncatted test 04: Pad header with 1000B extra for future metadata (failure OK/expected since test ... FAILED!
         ERR: FAILURE in ncatted failure: Pad header with 1000B extra for future metadata (failure OK/expected since test depends on command-line length)
         ERR::EXPLAIN: Result: [] != Expected: [27]
      
      nces    test 08: Check op with OpenMP............................................................... !!FAILED
        $cmd_rsl_is_num = 1 and $xpc_is_num = 0
         ERR: FAILURE in nces failure: Check op with OpenMP
         ERR::EXPLAIN: Result: [] != Expected: [n2 = 1]
      
      ncra    test 21: Check op with OpenMP............................................................... !!FAILED
        $cmd_rsl_is_num = 1 and $xpc_is_num = 0
         ERR: FAILURE in ncra failure: Check op with OpenMP
         ERR::EXPLAIN: Result: [] != Expected: [n2 = 1]
      
       
      Last edit: George N. White III 2014-02-04
      • Pedro Vicente
        Pedro Vicente
        2014-02-03

        George, can you try the same

        nces -D 5 foo00* bar010.nc

        but without the -D 5, please ?

        just

        nces foo00* bar010.nc

        Pedro

         
  • Pedro Vicente
    Pedro Vicente
    2014-02-03

    George, I fixed a bug regarding the use of -D 5, and committed the new version to CVS

    Can you give it a try?

    cvs -z3 -d:pserver:anonymous@nco.cvs.sf.net:/cvsroot/nco co -kk nco

    thanks

    Pedro

     
  • Pedro Vicente
    Pedro Vicente
    2014-02-03

    George, the error I found was only related to the nces run, not ncks.

    Let us know if you are still getting problems with ncks

    Pedro

     
  • Pedro Vicente
    Pedro Vicente
    2014-02-04

    nces gave a "malloc: *** error for object 0xfffffffb: pointer not allocated" with exit >code 134 after moving the temporary output file.

    I committed to CVS another version that fixes this

    Pedro

     
  • Pedro Vicente
    Pedro Vicente
    2014-02-04

    Hi George

    Are you still having problems with the current CVS snapshot?

    The NCO distribution has some netCDF files used for testing in

    /data/in.nc a netCDF3 file
    /data/in_grp.nc a netCDF4 file with groups

    In order for us to reproduce any bug, it would be helpful if you sent a simple command that exposes a problem with any of these files.

    Pedro

     
  • I'm still running tests on platforms other than OS X with real data, but we were able to complete the calculations that used many 100's of files. Thanks for your efforts.

     
  • Charlie Zender
    Charlie Zender
    2014-02-05

    Thank you, George. The combination of LLVM/clang and Mac OS X caught some problems that we had not tested for, and that were false negatives with our default GCC compiler options. We think we've addressed all of those now. We'll release 4.4.2 by the weekend unless new problems crop up. cz