Menu

Unknown but reproducible bug in ncks -A

Developers
2004-01-20
2013-10-17
  • Charlie Zender

    Charlie Zender - 2004-01-20

    Hi All,

    I've finally isolated a reproducible symptom of a recent NCO bug.
    The following four commands are TODO #306:

    cd ~/nco/data;ncgen -b -o in.nc in.cdl
    ncwa -O -a time -v u,v in.nc foo.nc # Compute time mean of u,v
    ncrename -O -v u,uavg -v v,vavg foo.nc # Rename to avoid conflict
    ncks -A -C -v u,v in.nc foo.nc # Place originals with time means

    work fine when NCO is built with bld/Makefile but the last command
    generates this error when NCO is built with autotools:

    ncks: ERROR attempt to write 1 dimensional input variable u to 0
    dimensional space in output file.

    Can anyone reproduce this problem? Any idea what's going on?

    I checked a few things and found nothing wrong with the code, yet.
    Building with --disable-shared does not fix things with autotools.
    I'm kinda running out of ideas. I hate build-specific problems.

    Thanks,
    Charlie

     
    • Rorik Peterson

      Rorik Peterson - 2004-01-20

      Charlie,
      I did a quick test and could not reproduce the bug.  I'll look into it further tomorrow.  Is this bug repeatable on different OS's?

      rorik

       
      • Charlie Zender

        Charlie Zender - 2004-01-21

        Hi Rorik,

        Thanks for looking at this.

        > Is this bug repeatable on different OS's?

        Good question. Apparently not:

        The problem occurs as described on my Debian Sid i686 laptops.

        The symptoms do not occur (autotools builds work fine) on AIX and SGI,
        and, surprisingly, on my RedHat 9 Linux desktop (with or without DODS).

        I know that I am not the only one having this problem.
        The thread on "Using NCO to calculate variance"

        https://sourceforge.net/forum/forum.php?thread_id=1005783&forum_id=9829

        drew my attention to this problem, hence the demo code.
        In other words, the variance computations I've suggested seem to work
        unless you try them on my laptop (and, I believe, the bug poster's)
        with autotools builds of NCO.

        Hmm.

        Charlie

         
    • Rorik Peterson

      Rorik Peterson - 2004-01-21

      I still cannot reproduce this bug, even on my Debian/sid machine.  Furthermore, it seems to me that the error message

      ncks: ERROR attempt to write 1 dimensional input variable u to 0 dimensional space in output file.

      has to come from one of two places:
      nco_var_utl.c: lines 309-310
      or
      nco_msa.c: lines 497-498

      however, I do not ever get either of these two places running the 3-line test you gave me.  I think this is because there are no dimension indicies specified with the -d option (which there isn't).  Perhaps the bug is buried somewhere such that ncks is thinking there is a dimension specification when there is not. 

      Anyway, there is not much more I think I can do until I can reproduce this error.  Thinking about the autotools specific nature of this, is there any change that different netCDF libraries are getting linked in the two cases?  I see that there are twot netCDF library calls right before the error message gets triggered. 

      rorik

       
      • Charlie Zender

        Charlie Zender - 2004-01-22

        Hi Rorik,

        > I still cannot reproduce this bug, even on my Debian/sid machine.  Furthermore,
        > it seems to me that the error message

        Hmm.

        > ncks: ERROR attempt to write 1 dimensional input variable u to 0 dimensional
        > space in output file.
        >
        > has to come from one of two places:
        > nco_var_utl.c: lines 309-310
        > or
        > nco_msa.c: lines 497-498

        I disagree. I believe it comes from nco_var_utl.c nco_cpy_var_val():
        line 208. I did not include the "HINT" part of the warning in the
        original posting. Sorry.

        > Anyway, there is not much more I think I can do until I can reproduce this error.

        Agreed. My gut tells me that this is a real problem and that it has to
        do with NCO's behavior possibly differing when it is dynamically
        linked, not to autotools per se. I've never thoroughly audited the
        code to guarantee it is fully "re-entrant". For instance, it's
        possible there are static variables in the library that may get
        set to weird states when multiple applications use NCO at the same
        time.

        > Thinking about the autotools specific nature of this, is there any change that
        > different netCDF libraries are getting linked in the two cases?  I see that
        > there are twot netCDF library calls right before the error message gets
        > triggered.

        Here is the full error. It occurs when building with --disable-shared
        or --enable-shared. One difference between bld/Makefile and
        --disable-shared is that bld/Makefile is completely staticly linked,
        where --disable-shared still dynamically links to libm and libc:

        zender@ashes:~/nco$ ldd `which ncks`
                libm.so.6 => /lib/libm.so.6 (0x40025000)
                libc.so.6 => /lib/libc.so.6 (0x40047000)
                /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
        zender@ashes:~/nco/data$ ncks -r
        NCO netCDF Operators version "2.8.7" built Jan 22 2004 on ashes by zender
        Copyright (C) 1995--2004 Charlie Zender
        ncks version "2.8.7"
        NCO is free software and comes with ABSOLUTELY NO WARRANTY
        NCO is distributed under the terms of the GNU General Public License
        "RIP Ed McMullin (1941--2003): Musician, Singer, Songwriter, Teacher, Father, Husband. Keep on Gig'n. http://dust.ess.uci.edu/ed"
        Linked to netCDF library version 3.5.1-beta10, compiled Nov  7 2003 23:09:02
        Homepage URL: http://nco.sf.net
        User's Guide: http://nco.sf.net/nco.html
        Configuration Option:   Active? Reference:
        Debugging: Custom       No      Pedantic, bounds checking (slowest execution)
        Debugging: Symbols      No      Produce symbols for debuggers (e.g., dbx, gdb)
        DODS/OpenDAP clients    No      http://nco.sf.net/nco.html#DODS
        Internationalization    No      http://nco.sf.net/nco.html#i18n (not ready)
        OpenMP Multi-threading  No      http://nco.sf.net/nco.html#omp (alpha testing)
        Optimization: run-time  Yes     Fastest execution possible (slowest compilation)
        UDUnits conversions     Yes     http://nco.sf.net/nco.html#UDUnits
        Wildcarding (regex)     Yes     http://nco.sf.net/nco.html#rx
        zender@ashes:~/nco$ cd ~/nco/data;ncgen -b -o in.nc in.cdl
        zender@ashes:~/nco/data$ ncwa -O -a time -v u,v in.nc foo.nc # Compute time mean of u,v
        zender@ashes:~/nco/data$ ncrename -O -v u,uavg -v v,vavg foo.nc # Rename to avoid conflict
        zender@ashes:~/nco/data$ ncks -A -C -v u,v in.nc foo.nc # Place originals with time means
        ncks: WARNING Overwriting global attribute Conventions
        ncks: WARNING Overwriting global attribute history
        ncks: WARNING Overwriting global attribute julian_day
        ncks: WARNING Overwriting global attribute RCS_Header
        ncks: ERROR attempt to write 1 dimensional input variable u to 0 dimensional space in output file.
        HINT: When using -A (append) option, all appended variables must be the same rank in the input file as in the output file. ncwa operator is useful at ridding variables of extraneous (size = 1) dimensions. Read the manual to see how.

        Thanks,
        Charlie

         
    • Rorik Peterson

      Rorik Peterson - 2004-01-22

      You get libm and libc to be statically linked using bld/Makefile?  I don't, and I'm not sure how that happens when using the linker flag -lm with gcc instead of manually adding /usr/lib/libm.a to the object files.  I also don't know how to statically link with libc in any circumstances (although I do have /usr/lib/libc.a).

      I don't follow how that could be happening with your GNU/Linux system. 

      rorik@chabuku:~/nco/bld$ make > make.log 2>&1
      rorik@chabuku:~/nco/bld$ ldd ../bin/ncks
              libnetcdf.so.3 => /usr/lib/libnetcdf.so.3 (0x40023000)
              libm.so.6 => /lib/libm.so.6 (0x40046000)
              libc.so.6 => /lib/libc.so.6 (0x40068000)
              /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

      rorik

       
      • Charlie Zender

        Charlie Zender - 2004-01-27

        I'm not going to have time to get back to this
        for a few weeks. I'm glad it's not affecting everyone,
        though it's still affecting me.

        Charlie

         
      • Charlie Zender

        Charlie Zender - 2004-01-30

        Hi Rorik,

        First, you are right about the static/dynamic linking. It may be a red herring anyway.

        've found what may be a difference between our builds that
        causes the "reproducible" problem to occur.

        The problem occurs on my fully updated Debian Sid box with

        ./configure --enable-optimize-custom --prefix=${HOME} --bindir=${MY_BIN_DIR} --datadir=${HOME}/nco/data --libdir=${MY_LIB_DIR} --mandir=${HOME}/nco/man > configure.${GNU_TRP}.foo 2>&1

        but not with

        ./configure --prefix=${HOME} --bindir=${MY_BIN_DIR} --datadir=${HOME}/nco/data --libdir=${MY_LIB_DIR} --mandir=${HOME}/nco/man > configure.${GNU_TRP}.foo 2>&1

        In other words, one or more of the many Linux gcc flags triggered by
        --enable-optimize-custom may cause the problem.
        Since I've already verified that the problem does not occur with AIX
        or SGI compilers, it makes sense to me that it's one of the GCC flags.
        Will you try building both ways and verify whether this is the case?
        If you can reproduce the problem then the next step is to figure out
        which gcc flag causes it and then we may pinpoint the offending code.

        Thanks,
        Charlie

         
    • Rorik Peterson

      Rorik Peterson - 2004-01-30

      Charlie,
      I think the problem is -fshort-enums.  According to the GCC manual

      "Warning: the -fshort-enums switch causes GCC to generate code that is not binary compatible with code generated without that switch. Use it to conform to a non-default application binary interface."

      The error occurs when libnetcdf is build without -fshort-enums.  I built netcdf-3.5.1-beta10 with and without the flag, and the error seems to go away when both nco and netcdf have it, and obviously, when they both do not. 

      I think we should get rid of -fshort-enums.  The savings are not worth bothering people to rebuild libnetcdf.

      rorik

       
      • Charlie Zender

        Charlie Zender - 2004-01-30

        Excellent work!
        I will remove that switch and release 2.8.8 ASAP.

        Thanks,
        Charlie

         

Log in to post a comment.