Menu

ESMF batch benchmarks for 2,4, 8 threads

Developers
2005-06-17
2013-10-17
  • Charlie Zender

    Charlie Zender - 2005-06-17

    Hi,

    The following benchmarks come from runs in ESMF batch queue rg8.
    These were run by changing NTHREADS in nco_bm.sh and then submitting

    llsubmit nco_bm.sh

    The idea of running in the batch queue is to guarantee proceesor
    availability and to minimize benchmark variability.
    Yet, as you will see, variability can still be quite high, e.g., ncpdq.
    Why is there any significant variability in queues?

    The benchmarks are for 2, 4, and 8 threads, each repeated twice.
    Some operators will require more (four? eight?) tests to get a
    statistically meaningful result.
    I'd like to understand why variability is so high so that we can
    reduce it if possible and therefore get away with fewer repetitions.

    ncwa seems to scale well, as expected, though results are less clear
    for the remaining threading operators.
    Probably next week I will discuss with Harry revising the benchmarks
    to better highlight certain operations, and to provide more concise
    diagnostics of information nco_bm.pl already measures.
    If you have suggestions for benchmark revisions, please post them
    soon, as we do not want to change the benchmarks very often.

    Thanks,
    Charlie

         Test   Success    Failure   Total       Time   (OMP threads = 8)
          ncap:        6          4      10     279.5215
       ncatted:        1                  1     0.5853
          ncbo:        8                  8     271.6180
       ncflint:        3                  3     0.8077
          ncea:        6                  6     99.4940
        ncecat:        1                  1     0.2696
          ncks:       15                 15     1.8946
         ncpdq:        6                  6     454.3183
          ncra:       15          2      17     109.3245
          ncwa:       37                 37     79.4942

         Test   Success    Failure   Total       Time   (OMP threads = 8)
          ncap:        6          4      10     265.7919
       ncatted:        1                  1     0.5488
          ncbo:        8                  8     260.2727
       ncflint:        3                  3     0.7674
          ncea:        6                  6     86.4602
        ncecat:        1                  1     0.2761
          ncks:       15                 15     1.8597
         ncpdq:        6                  6     358.6496
          ncra:       15          2      17     109.3547
          ncwa:       37                 37     77.3682

    loadleveler:
         Test   Success    Failure   Total       Time   (OMP threads = 4)
          ncap:        6          4      10     354.6031
       ncatted:        1                  1     0.5587
          ncbo:        8                  8     273.5378
       ncflint:        3                  3     0.7771
          ncea:        6                  6     109.0394
        ncecat:        1                  1     0.2613
          ncks:       15                 15     1.9057
         ncpdq:        6                  6     660.7932
          ncra:       15          2      17     135.5798
          ncwa:       37                 37     130.6776

    loadleveler:
         Test   Success    Failure   Total       Time   (OMP threads = 4)
          ncap:        6          4      10     312.1353
       ncatted:        1                  1     0.6110
          ncbo:        8                  8     262.2002
       ncflint:        3                  3     0.6750
          ncea:        6                  6     94.6699
        ncecat:        1                  1     0.3568
          ncks:       15                 15     1.9946
         ncpdq:        6                  6     461.3167
          ncra:       15          2      17     136.2394
          ncwa:       37                 37     121.1632

         Test   Success    Failure   Total       Time   (OMP threads = 2)
          ncap:        6          4      10     381.3692
       ncatted:        1                  1     0.5680
          ncbo:        8                  8     234.7483
       ncflint:        3                  3     0.6513
          ncea:        6                  6     117.4369
        ncecat:        1                  1     0.2669
          ncks:       15                 15     1.7837
         ncpdq:        6                  6     509.1903
          ncra:       15          2      17     153.3728
          ncwa:       37                 37     156.7412

         Test   Success    Failure   Total       Time   (OMP threads = 2)
          ncap:        6          4      10     359.3890
       ncatted:        1                  1     0.5667
          ncbo:        8                  8     224.5704
       ncflint:        3                  3     0.6571
          ncea:        6                  6     103.7265
        ncecat:        1                  1     0.2788
          ncks:       15                 15     1.9143
         ncpdq:        6                  6     480.8395
          ncra:       15          2      17     151.8112
          ncwa:       37                 37     156.5773

     
    • Nobody/Anonymous

      The batch guarantees CPU availability but does it also guarantee i/o bandwidth to disk?  I don't know the i/o arch of the esmf, but this looks like the tests are waiting for disk, possibly contending for disk i/o with processes on other nodes.  Is that a possibility or does each node get their own minimum disk bandwidth?  Do the local nodes have some kind of local-to-the-node disk storage that all i/o goes back and forth to that storage before being written out to permanent storage?

      If not, I think that would explain it.

      hjm

       
      • Charlie Zender

        Charlie Zender - 2005-06-17

        Hi Harry,

        Of course it's possible this is a result of disk contention for /ptmp.
        I mean, what else could it be? I'm surprised this turns out to be
        such a large factor, though, because ESS jobs typically do not
        do a lot of I/O. Anyway, this theory can (and should) be tested
        by altering the batch benchmarks to write to /tmp (which is local)
        rather than /ptmp (which is shared), thus removing I/O contention
        from other jobs.

        Also, it might be helpful to (re-)add either or both of the user and sys times to the benchmark summaries, to make these distinctions clearer.

        Thanks,
        Charlie

         

Log in to post a comment.