nco_fl_utl.c: check that local system has read permissions can fail for reasons not listed; possible race condition

Developers
2014-01-29
2014-10-03
  • To see the problem on OS X Snow Leopard using macports nco-4.4.0:

    First, copy data/foo.nc (from an older source install) to /tmp/fopen_limit, then link to
    foo000.nc, ..., foo999.nc.

        $ cd /tmp
        $ cp /opt/NCO/src/nco-4.3.6/data/foo.nc .
        $ for i in $(gseq -w 0 999) do ln foo.nc foo${i}.nc ; done
    

    Run nces for varying numbers of files:

        $ nces -y avg foo00?.nc bar10.nc
        $ nces -y avg foo0??.nc bar100.nc
        $ nces -y avg foo[0-2]??.nc bar300.nc
        nces: ERROR User does not have read permission for foo252.nc, or file does not exist
    

    In this case, nces is telling fibs. The file does exist, and the permissions are fine, but
    fopen() limits the number of open files. I assume fl_nm_lcl gets opened again by the NetCDF library and must remain open for nces processing. [OS X Snow Leopard fopen() gets maxfiles=256 by
    default]
    .

    The error comes from nco_fl_utl.c:

        /* File is (now, anyway) truly local---does local system have read permission? */
        if((fp_in=fopen(fl_nm_lcl,"r")) == NULL){
          (void)fprintf(stderr,"%s: ERROR User does not have read permission for %s, or file does not exist\n",nco_prg_nm_get(),fl_nm_lcl);
          nco_exit(EXIT_FAILURE);
        }else{
          (void)fclose(fp_in);
        } /* end else */
    

    It would be helpful see the perror() message. Doesn't the NetCDF library perform additional checks (e.g., is the file a NetCDF)? In the past I have seen problems in programs that close files they intend to use again. Back when a 9GB SCSI disk cost $2500, we had collections of compressed files. Users had to decompress the files they were using, and then compress them to make space for the next task. On unix, once you open the file it remains "readable" even if another user compresses it, but you could have fopen() succeed and the file disappear before it is opened again for NetCDF.

     
    Last edit: George N. White III 2014-01-30
  • Charlie Zender
    Charlie Zender
    2014-01-29

    Thank you for reporting this important bug. The problem is that the close()
    statement for ncea/ncra was recently (since no earlier than July) inadvertently moved out of the loop where it belongs so files are not being closed and that triggers the bug you encountered. Will post again when fix is ready. This only affects NCO versions 4.3.9-4.4.0.
    cz

     
  • Charlie Zender
    Charlie Zender
    2014-01-29

    I have committed a fix. I you have the ability and inclination please build the latest CVS snapshot and see if it fixes the problem you reported and let us know. This is an important bug and we will release version 4.4.1 as soon as we know the fix works.
    Thanks again,
    c

     
    • Thanks for the quick response.

      I just downloaded the snapshot tar archive and built the test version. My tests now run with 1000 files (and the results appear to be correct!). Thanks again.

      Now testing the snapshot version on RHEL 5 (using devtoolset-2 gcc 4.8.1 compilers) and jobs that were working with are faulting:

      $ nces -D 5 -y avg foo00[1-2].nc nashi002.nc
      [...]
      nces: INFO Moving nashi002.nc.pid9187.nces.tmp to nashi002.nc...done
      Segmentation fault
      $ echo $?
      139
      

      The output file nashi002.nc seems fine. I tried running the same
      task under gdb and there was no error, but ptrace gives some info:

      $ strace nces -D 5 -y avg foo001.nc foo002.nc nashi002.nc
      [...]
      write(2, "nces: INFO Moving nashi002.nc.pi"..., 64nces: INFO Moving nashi002.nc.pid9206.nces.tmp to nashi002.nc...) = 64
      rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x3dd88302d0}, {SIG_DFL, [], 0}, 8) = 0
      rt_sigaction(SIGQUIT, {SIG_IGN, [], SA_RESTORER, 0x3dd88302d0}, {SIG_DFL, [], 0}, 8) = 0
      rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
      clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fff12677d48) = 9208
      wait4(9208, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 9208
      rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x3dd88302d0}, NULL, 8) = 0
      rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x3dd88302d0}, NULL, 8) = 0
      rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
      --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=9208, si_status=0, si_utime=1, si_stime=2} ---
      write(2, "done\n", 5done
      )                   = 5
      --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x3cfffffff9} ---
      +++ killed by SIGSEGV +++
      Segmentation fault
      

      and

      $ valgrind nces -D 5 -y avg foo001.nc foo002.nc nashi002.nc
      [...]
      nces: INFO Moving nashi002.nc.pid10042.nces.tmp to nashi002.nc...done
      ==10042== Conditional jump or move depends on uninitialised value(s)
      ==10042==    at 0x40B55F: main (in /usr/local/bin/ncra)
      ==10042==
      ==10042== Conditional jump or move depends on uninitialised value(s)
      ==10042==    at 0x44C532: nco_free (in /usr/local/bin/ncra)
      ==10042==    by 0x40B56F: main (in /usr/local/bin/ncra)
      ==10042==
      ==10042== Conditional jump or move depends on uninitialised value(s)
      ==10042==    at 0x4A0739D: free (in /opt/rh/devtoolset-2/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==10042==    by 0x44C53F: nco_free (in /usr/local/bin/ncra)
      ==10042==    by 0x40B56F: main (in /usr/local/bin/ncra)
      ==10042==
      ==10042== Invalid free() / delete / delete[] / realloc()
      ==10042==    at 0x4A073EA: free (in /opt/rh/devtoolset-2/root/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==10042==    by 0x44C53F: nco_free (in /usr/local/bin/ncra)
      ==10042==    by 0x40B56F: main (in /usr/local/bin/ncra)
      ==10042==  Address 0x3d00000001 is not stack'd, malloc'd or (recently) free'd
      ==10042==
      nces: TIMER Wallclock-elapsed time for command is    1.52 s
      ==10042==
      ==10042== HEAP SUMMARY:
      ==10042==     in use at exit: 107,000 bytes in 1,912 blocks
      ==10042==   total heap usage: 28,068 allocs, 26,157 frees, 3,011,808 bytes allocated
      ==10042==
      ==10042== LEAK SUMMARY:
      ==10042==    definitely lost: 59,306 bytes in 720 blocks
      ==10042==    indirectly lost: 31,306 bytes in 1,189 blocks
      ==10042==      possibly lost: 0 bytes in 0 blocks
      ==10042==    still reachable: 16,388 bytes in 3 blocks
      ==10042==         suppressed: 0 bytes in 0 blocks
      ==10042== Rerun with --leak-check=full to see details of leaked memory
      ==10042==
      ==10042== For counts of detected and suppressed errors, rerun with: -v
      ==10042== Use --track-origins=yes to see where uninitialised values come from
      ==10042== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 7 from 7)
      

      I also ran make test and got many failures:

      Linux nashi.bio.dfo.ca 2.6.18-371.3.1.el5 #1 SMP Mon Nov 11 03:23:58 EST 2013 x86_64 x86_64 x86_64 GNU/Linux; gcc (GCC) 4.8.1 20130715 (Red Hat 4.8.1-4);
      
                          Test Results                Seconds to complete
                   --------------------------   ----------------------------------------
            Test   Success    Failure   Total   WallClock    Real   User  System    Diff
           ncap2:        0         11      11       0.00     0.00   0.00    0.00    0.00
         ncatted:        0         11      11       0.00     0.00   0.00    0.00    0.00
            ncbo:        1         23      24       0.03     0.00   0.00    0.00    0.00
         ncflint:        0          8       8       0.00     0.00   0.00    0.00    0.00
            nces:        0         17      17       0.02     0.00   0.00    0.00    0.00
          ncecat:        0         12      12       0.00     0.00   0.00    0.00    0.00
            ncks:       32         46      78       0.77     0.00   0.00    0.00    0.00
           ncpdq:        0         48      48       0.00     0.00   0.00    0.00    0.00
            ncra:        0         31      31       0.01     0.00   0.00    0.00    0.00
          ncrcat:        0         27      27       0.04     0.00   0.00    0.00    0.00
        ncrename:        0         22      22       0.00     0.00   0.00    0.00    0.00
            ncwa:        0         57      57       0.05     0.00   0.00    0.00    0.00
      

      The previous build (as well as the new Snow Leopard build (using Apple gcc-4.2) had few failures:

      This is 4.4.0 built earlier today (with the unclosed files bug):
      Linux nashi.bio.dfo.ca 2.6.18-371.3.1.el5 #1 SMP Mon Nov 11 03:23:58 EST 2013 x86_64 x86_64 x86_64 GNU/Linux; gcc (GCC) 4.8.1 20130715 (Red Hat 4.8.1-4);

      Test Results Seconds to complete
      -------------------------- ----------------------------------------
      Test Success Failure Total WallClock Real User System Diff
      ncap2: 11 11 0.29 0.00 0.00 0.00 0.00
      ncatted: 11 11 0.58 0.00 0.00 0.00 0.00
      ncbo: 24 24 1.17 0.00 0.00 0.00 0.00
      ncflint: 8 8 0.85 0.00 0.00 0.00 0.00
      nces: 16 1 17 0.58 0.00 0.00 0.00 0.00
      ncecat: 12 12 0.66 0.00 0.00 0.00 0.00
      ncks: 78 78 2.74 0.00 0.00 0.00 0.00
      ncpdq: 48 48 1.74 0.00 0.00 0.00 0.00
      ncra: 30 1 31 1.13 0.00 0.00 0.00 0.00
      ncrcat: 27 27 2.00 10.00 0.00 0.00 10.00
      ncrename: 19 3 22 0.92 0.00 0.00 0.00 0.00
      ncwa: 57 57 2.12 1.00 4.00 0.00 -3.00

      Recompiling pre-4.4.1 with the unclosed files patch with "-Og"
      gives the same failures. My nces tests also work, so this looks like a "-O2" bug in gcc-4.8.1.

       
      Last edit: George N. White III 2014-01-30
  • Simon
    Simon
    2014-10-03

    I have a similar error to the original post.
    nco 4.4.1
    nces 4.4.0
    Mas OSX Darwin Kernel Version 13.4.0

    nces says I don't have permission to read 3 files, but I clearly do. This is not a problem for hundreds of other files. I use the nco-recommended way of passing netcdf files to the averager nces in bash because otherwise the argument list is too long.

    $ /bin/ls | grep ei.oper.an.pl.regn128sc.'........'00.nc | xargs nces --fortran --variable T_GDS4_ISBL --dimension lv_ISBL0,6,6 --dimension g4_lat_1,105,152 --overwrite --output erai_T850_00.nc
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.1979091000.nc, or file does not exist
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.1991070900.nc, or file does not exist
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.2003050700.nc, or file does not exist
    $ ls -l ei.oper.an.pl.regn128sc.1979091000.nc
    -rw-r--r--  1 user  staff  18880068 Sep  3  2013 ei.oper.an.pl.regn128sc.1979091000.nc
    $ ls -l ei.oper.an.pl.regn128sc.1991070900.nc
    -rw-r--r--  1 user  staff  18880068 Sep  3  2013 ei.oper.an.pl.regn128sc.1991070900.nc
    $ ls -l ei.oper.an.pl.regn128sc.2003050700.deszoeke43074.nc
    -rw-r--r--  1 user  staff  18880068 Sep  3  2013 ei.oper.an.pl.regn128sc.2003050700.deszoeke43074.n
    

    I suspect this is an error or inability of nces, and not a permission problem with the files. When I rename these three files to skip them, the next three consecutive files exhibit the error:

    $ /bin/ls | grep ei.oper.an.pl.regn128sc.'........'00.deszoeke43074.nc | xargs nces --fortran --variable T_GDS4_ISBL --dimension lv_ISBL0,6,6 --dimension g4_lat_1,105,152 --overwrite --output erai_T850_00.nc
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.1979091100.nc, or file does not exist
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.1991071100.nc, or file does not exist
    nces: ERROR User does not have read permission for ei.oper.an.pl.regn128sc.2003051000.nc, or file does not exist
    

    The number of files between errors is quasi-regular:
    1979091100 is file number 253
    1991070900 is file number 4573, difference 4320
    2003051000 is file number 8893, difference 4320

    I have updated to nco version 4.4.5 and nces is running...no error yet.

     
  • Charlie Zender
    Charlie Zender
    2014-10-03

    You seem to have triggered the same problem as above because you were running 4.4.0 and the problem is not fixed until the 4.4.1 release. If you encoutner any problems with 4.4.5 or 4.4.6 please let us know.
    cz