Menu

Recent thread problems

Developers
2005-06-07
2013-10-17
  • Charlie Zender

    Charlie Zender - 2005-06-07

    Hi All,

    I ran a series of date-stamped regression tests
    using bisection to bracket the date when the ncra regressions
    started failing on the ESMF. e.g.,

    cvs -z3 -dzender@cvs.sf.net:/cvsroot/nco co -D 20050524 \ -d nco-20050524 nco
    cd ~/nco-20050524/bld
    make dst_cln;make;make tst

    Results are

        20050605: Fail;
        20050524: Fail;
        20050521: Fail;
        20050520: Fail;
        20050519: Succeed;
        20050517: Succeed;
        20050510: Succeed;

    ChangeLog for 20050519 is

    2005-05-19  Charlie Zender  <zender@uci.edu>

        * Change nco_get_var*() routines in nco_var_get() from SMP
        critical to non-critical. Not sure why they were originally
        critical.

        * Change nco_var_upk*() routine in nco_var_get() from SMP
        critical to non-critical. Not sure why it was originally
        critical.

        * Make nco_var_refresh() routine from SMP non-critical.
        Not sure why it was originally critical.

    I thought I tested on the ESMF after I un-blocked the critical regions
    in the netCDF read routines. Apparently not.
    I will revert those patches and see if this fixes things.

    In the meantime, Harry, could you please install icc on soot as debugging
    this threading with valgrind will probably be much easier there
    than on ESMF?

    P.S. This bug may mean that libnetcdf.a is not thread safe
    on reads, contrary to our assumptions.

    Thanks,
    Charlie

     
    • Charlie Zender

      Charlie Zender - 2005-06-08

      Hi All,

      The current NCO snapshot make netCDF read calls critical (again).
      From the ChangeLog:

          * Reverting the 20050519 critical region patches fixes ncra
          regressions on AIX OpenMP. Hypothesis that libnetcdf.a is not
          thread-safe on reads (or writes) gains considerable weight,
          since all the patches do, I think, is make reads critical.

      This raises the question whether any of the threaded operators
      is actually bug-free, or whether the rest (except ncwa) just appear
      bug-free because the timing errors are not triggered.

      The results from Harry's benchmark tests on the ESMF with the current
      snapshot will guide me whether to change thr_nbr_max_fsh to 1 for all
      operators (currently only set to avoid threading bugs for ncwa).
      If anything would trigger timing errors (i.e., uncover thread
      contention bugs), the benchmark tests should.

      Please test and let me know whether there seem to be non-ncwa
      threading errors. Once I know what to do with thr_nbr_max_fsh,
      we can release NCO 3.0.1 and really crack down on the understanding
      the threading during that release cycle.

      Thanks,
      Charlie

       
    • Charlie Zender

      Charlie Zender - 2005-06-09

      Hi All,

      Using icc and valgrind on soot I tracked down and fixed all remaining 
      threading problems as far as I can tell.
      The gory details are in the ChangeLog.

      The key to ncwa was:

          * Change DO_CONFORM_MSK,DO_CONFORM_WGT from private to
          firstprivate and make sure both always have default values
          This fixed all ncwa regressions with threads!

      It may be that the critical regions are not necessary on reads
      after all (i.e., possible that was a symptom of this bug).
      I will try un-reverting the critical region on read patch tomorrow.

      Anyhow the current code has no known problems with any number
      of threads on any operator.
      I updated thr_nbr_max_fsh for many operators.

      P.S. Thanks to Harry for installing the updated intel compilers
      which made finding and fixing these problems a lot more fun.

      Charlie

       

Log in to post a comment.