[Assorted-commits] SF.net SVN: assorted: [604] numa-bench/trunk/doc/analysis.txt
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-03-05 06:11:34
|
Revision: 604 http://assorted.svn.sourceforge.net/assorted/?rev=604&view=rev Author: yangzhang Date: 2008-03-04 22:11:35 -0800 (Tue, 04 Mar 2008) Log Message: ----------- updated for new results Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-03-05 06:11:23 UTC (rev 603) +++ numa-bench/trunk/doc/analysis.txt 2008-03-05 06:11:35 UTC (rev 604) @@ -1,6 +1,10 @@ % NUMA Benchmarks Analysis % Yang Zhang +__Updates__ + +- 3/4/08: updated scalability experiments + All tests were performed on `josmp.csail.mit.edu`. The [graphs](graphs) show the results of running several different experiments. The results are averaged across three trials for each experiment. The experiments varied the following @@ -8,14 +12,15 @@ - number of threads (CPUs, 1-16, usually 16 if not testing scalability) - size of the memory buffer to operate on (10MB, 100MB, or 1GB) -- number of times to repeat the operation (usually one) +- number of operations, i.e. reads/writes (frequently 10 million) +- number of times to repeat the chewing (usually 1) - whether to chew through the memory sequentially or using random access - whether to run operations in parallel on all the CPUs - whether to explicitly pin the threads to a CPU (usually we do) - whether to operate on a global buffer or on our own buffer (that we allocate ourselves) or on buffers that all other nodes allocated (for cross-communication) -- whether to perform writes to the buffer, otherwise just read +- whether operations are writes or reads - in experiments varying the number of cores $k$ working concurrently: whether we're using cores 1 through $k$ or cores across the nodes in round-robin fashion @@ -25,48 +30,71 @@ - How much does working from another node affect throughput? - It doesn't make much difference for sequential scans - this shows hardware prefetching (and caching) at work. It still makes [a bit of - difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf). + difference](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf). - However, for random accesses, the difference is much more - [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf). + [pronounced](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf). - How much difference is there between sequential scan and random access? - Substantial difference. Also magnifies NUMA effects. Compare - [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf) + [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) and - [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Read vs. write - - Substantial difference. Random writes are ~2x slower than random reads. + - Substantial difference. Random writes are over 2x slower than random reads. - Compare - [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf) + [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) and - [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Does `malloc` tend to allocate locally? - Yes, because working with memory allocated from the current thread shows improved times. - Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes - - Graphs for each of these: - [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf) + - Throughputs for sequential scans: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf) - - Local memory node access is best but still has problems scaling. The time - remains constant after some point. This is probably because increasing the - number of cores causes the load distribution to approach a more uniform - distribution. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Throughputs for random access: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads - - Graphs for each of these: - [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf) + - Throughputs for sequential scans: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf) + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf) - - Cross-communicating performs worse, and local memory node access performs - the same as shared memory access. This is expected, since we aren't - performing writes, so the data is freely replicated to all caches (same - reason that there is little difference between the non-parallel reads from - local vs. remote). + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Throughputs for random access: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) -There's still quite a bit of room to fill out this test suite. For instance, -the experiments varying the number of cores all exercise the fewest number of -chips; the results may be quite different for tests that distribute the loaded -cores across all chips. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |