The following benchmarks come from runs in ESMF batch queue rg8.
These were run by changing NTHREADS in nco_bm.sh and then submitting
llsubmit nco_bm.sh
The idea of running in the batch queue is to guarantee proceesor
availability and to minimize benchmark variability.
Yet, as you will see, variability can still be quite high, e.g., ncpdq.
Why is there any significant variability in queues?
The benchmarks are for 2, 4, and 8 threads, each repeated twice.
Some operators will require more (four? eight?) tests to get a
statistically meaningful result.
I'd like to understand why variability is so high so that we can
reduce it if possible and therefore get away with fewer repetitions.
ncwa seems to scale well, as expected, though results are less clear
for the remaining threading operators.
Probably next week I will discuss with Harry revising the benchmarks
to better highlight certain operations, and to provide more concise
diagnostics of information nco_bm.pl already measures.
If you have suggestions for benchmark revisions, please post them
soon, as we do not want to change the benchmarks very often.
The batch guarantees CPU availability but does it also guarantee i/o bandwidth to disk? I don't know the i/o arch of the esmf, but this looks like the tests are waiting for disk, possibly contending for disk i/o with processes on other nodes. Is that a possibility or does each node get their own minimum disk bandwidth? Do the local nodes have some kind of local-to-the-node disk storage that all i/o goes back and forth to that storage before being written out to permanent storage?
If not, I think that would explain it.
hjm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Of course it's possible this is a result of disk contention for /ptmp.
I mean, what else could it be? I'm surprised this turns out to be
such a large factor, though, because ESS jobs typically do not
do a lot of I/O. Anyway, this theory can (and should) be tested
by altering the batch benchmarks to write to /tmp (which is local)
rather than /ptmp (which is shared), thus removing I/O contention
from other jobs.
Also, it might be helpful to (re-)add either or both of the user and sys times to the benchmark summaries, to make these distinctions clearer.
Thanks,
Charlie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
The following benchmarks come from runs in ESMF batch queue rg8.
These were run by changing NTHREADS in nco_bm.sh and then submitting
llsubmit nco_bm.sh
The idea of running in the batch queue is to guarantee proceesor
availability and to minimize benchmark variability.
Yet, as you will see, variability can still be quite high, e.g., ncpdq.
Why is there any significant variability in queues?
The benchmarks are for 2, 4, and 8 threads, each repeated twice.
Some operators will require more (four? eight?) tests to get a
statistically meaningful result.
I'd like to understand why variability is so high so that we can
reduce it if possible and therefore get away with fewer repetitions.
ncwa seems to scale well, as expected, though results are less clear
for the remaining threading operators.
Probably next week I will discuss with Harry revising the benchmarks
to better highlight certain operations, and to provide more concise
diagnostics of information nco_bm.pl already measures.
If you have suggestions for benchmark revisions, please post them
soon, as we do not want to change the benchmarks very often.
Thanks,
Charlie
Test Success Failure Total Time (OMP threads = 8)
ncap: 6 4 10 279.5215
ncatted: 1 1 0.5853
ncbo: 8 8 271.6180
ncflint: 3 3 0.8077
ncea: 6 6 99.4940
ncecat: 1 1 0.2696
ncks: 15 15 1.8946
ncpdq: 6 6 454.3183
ncra: 15 2 17 109.3245
ncwa: 37 37 79.4942
Test Success Failure Total Time (OMP threads = 8)
ncap: 6 4 10 265.7919
ncatted: 1 1 0.5488
ncbo: 8 8 260.2727
ncflint: 3 3 0.7674
ncea: 6 6 86.4602
ncecat: 1 1 0.2761
ncks: 15 15 1.8597
ncpdq: 6 6 358.6496
ncra: 15 2 17 109.3547
ncwa: 37 37 77.3682
loadleveler:
Test Success Failure Total Time (OMP threads = 4)
ncap: 6 4 10 354.6031
ncatted: 1 1 0.5587
ncbo: 8 8 273.5378
ncflint: 3 3 0.7771
ncea: 6 6 109.0394
ncecat: 1 1 0.2613
ncks: 15 15 1.9057
ncpdq: 6 6 660.7932
ncra: 15 2 17 135.5798
ncwa: 37 37 130.6776
loadleveler:
Test Success Failure Total Time (OMP threads = 4)
ncap: 6 4 10 312.1353
ncatted: 1 1 0.6110
ncbo: 8 8 262.2002
ncflint: 3 3 0.6750
ncea: 6 6 94.6699
ncecat: 1 1 0.3568
ncks: 15 15 1.9946
ncpdq: 6 6 461.3167
ncra: 15 2 17 136.2394
ncwa: 37 37 121.1632
Test Success Failure Total Time (OMP threads = 2)
ncap: 6 4 10 381.3692
ncatted: 1 1 0.5680
ncbo: 8 8 234.7483
ncflint: 3 3 0.6513
ncea: 6 6 117.4369
ncecat: 1 1 0.2669
ncks: 15 15 1.7837
ncpdq: 6 6 509.1903
ncra: 15 2 17 153.3728
ncwa: 37 37 156.7412
Test Success Failure Total Time (OMP threads = 2)
ncap: 6 4 10 359.3890
ncatted: 1 1 0.5667
ncbo: 8 8 224.5704
ncflint: 3 3 0.6571
ncea: 6 6 103.7265
ncecat: 1 1 0.2788
ncks: 15 15 1.9143
ncpdq: 6 6 480.8395
ncra: 15 2 17 151.8112
ncwa: 37 37 156.5773
The batch guarantees CPU availability but does it also guarantee i/o bandwidth to disk? I don't know the i/o arch of the esmf, but this looks like the tests are waiting for disk, possibly contending for disk i/o with processes on other nodes. Is that a possibility or does each node get their own minimum disk bandwidth? Do the local nodes have some kind of local-to-the-node disk storage that all i/o goes back and forth to that storage before being written out to permanent storage?
If not, I think that would explain it.
hjm
Hi Harry,
Of course it's possible this is a result of disk contention for /ptmp.
I mean, what else could it be? I'm surprised this turns out to be
such a large factor, though, because ESS jobs typically do not
do a lot of I/O. Anyway, this theory can (and should) be tested
by altering the batch benchmarks to write to /tmp (which is local)
rather than /ptmp (which is shared), thus removing I/O contention
from other jobs.
Also, it might be helpful to (re-)add either or both of the user and sys times to the benchmark summaries, to make these distinctions clearer.
Thanks,
Charlie