NCO, threading and filesystem access

Developers
2009-03-27
2013-10-17
  • Folks,
    you might know that GPFS on bluefire experiences some "freezes". We were able to track down those problems with usage of an NCO with 32 OpenMP threads.
    To further investigate the problem on our side, I have some questions:

    1) what is the  filesystem access pattern when running in 32-threads? I'm interested mostly in ncap2 and ncra

    2) according to the manual, ncap2 does not have threading control. Does it use threads at all?
    2a) If so, does it use the OMP_NUM_THREADS env var?
    2b) if not, I wonder what's going on, since we see 32-ways ncap2 processes around (maybe a bug?)

    Thanks,
    Davide Del Vento, Consulting Services Software Engineer
    NCAR Computational & Information Services Laboratory
    http://www.cisl.ucar.edu/hss/csg/
    office: Mesa Lab, Room 42B

     
    • henry Butowsky
      henry Butowsky
      2009-03-27

      HI there,
      Am very interested in your threading experiences with ncap2.
      ncap2 does indeed use threads.
      Each thread has has:
        an input read file handle
        an output read file handle
        a common write write output file handle.

      ncap2 takes a list of expressions and sorts them into independent blocks which can be evaluated in parallel ( I will email you the full details of this later)

      The writing of output is an OMP critical region , so only one thread can write at any time. So Im guessing but the freezes you experiencing could be that you have 30 or so threads queing up to write.  So far I have only tested the threading on my hunble Intel Quad core machine and UCI's ESMF.  The other possible cause of freezing is that the antlr generated is not totally thread safe.(but I didn't experience this problem with ncap2 on ESMF nb compiled with Xlc)

      Im no expert on GPFS, but my understanding is that you can use netcdf on it and have parallel writes.   Hence it may be possible  to create a custom version of ncap2 which has the critical regions removed  so each thread has is own output write handle.
      Anyway Im sure Charlie will have some comments on this

      If you run ncap2 with -D 2  and look carefully at the output you will see how ncap2 "sorts" a list of expressions into independent blocks

      regards Henry 

       

       
      • Charlie Zender
        Charlie Zender
        2009-03-27

        Hello Davide,

        Thank you for your report.
        Reports from users on 32-way nodes are very helpful in stress-testing

        Message was a little unclear.
        Is ncap2 freezing on GPFS?
        or is GPFS freezing independent of ncap2, and you're using ncap2
        to diagnose it?
        I'll assume the former (please correct me if I'm wrong).

        My understanding:
        NCO threading works, bug-free for all operators except ncap2.
        The only recent buggy threading in NCO occurred in version 3.9.5.
        You pointed out the problem and we found and fixed the bug.

        Yes, ncap2 documentation on threading is out of date.
        It works that same as on other operators:
        Has an internal default, uses OMP_NUM_THREADS if available,
        and command-line -t takes highest precedence.
        Until today, ncap2, ncra, ncwa, had no internal maximum.
        Hence they ran with 32 threads when available.

        Current ncap2 problems are traceable to underlying ANTLR library.
        The ncap2 threading problems seem to be triggered only rarely:
        Yours, i.e., from NCAR on 32-way nodes, are the only reports.
        Unfortunately, we can't fix this problem until ANTLR3 C++ front-end
        is finished (nobody knows when).

        Here is what I suggest:
        Turn-off ncap2 threading by default.
        Restrict default threads for ncrcat,ncecat to 2.
        Restrict default threads for ncrcat,ncecat to 4.
        I have just commmitted a patch to nco_omp.c that does this.

        Feedback welcome. Changes will be in NCO 3.9.8.

        Regarding removal of critical regions for GPFS executables:
        I doubt this will work. I think we need to use netCDF4 I/O
        to get parallel writes working on parallel filesystems.

        Charlie

         
        • Well,
          actually we do not know exactly what's going on. We know that GPFS freezes once in a while. Since we saw that twice recently, while ncap2 and ncra were running, it looked like a smoking gun (but it might not).
          I will try to reproduce the problem, but your info have been helpful in understanding what might be happening (Henry, I'm interested in the additional details you mentioned).

          Thanks and have a nice weekend,

          Davide Del Vento, Consulting Services Software Engineer
          NCAR Computational & Information Services Laboratory
          http://www.cisl.ucar.edu/hss/csg/
          office: Mesa Lab, Room 42B

           
  • Dear all

    I'm working on netcdf files for sst and wind monthly mean.

    I have daily data from 2005 to 2009 and would like to use NCO operators to avegare daily data to monthly mean.

    When I use ncea inputfile.nc1 inputfile.nc2 output, it work only for 2 files, more than two files it gives false results.

    Someone help me ?

    Thanks

     
  • Charlie Zender
    Charlie Zender
    2011-01-28

    if threading is the problem then running the operators with the

    -t 0

    option will turn of threading and fix the problem.

    please try that and report back.

    in any case, please generate a small reproducible example of the problem
    with the latest NCO (4.0.6) if you want me to look into it in more detail.

    thx,
    c