[Assorted-commits] SF.net SVN: assorted: [418] numa-bench/trunk
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-02-15 01:44:16
|
Revision: 418 http://assorted.svn.sourceforge.net/assorted/?rev=418&view=rev Author: yangzhang Date: 2008-02-14 17:44:23 -0800 (Thu, 14 Feb 2008) Log Message: ----------- added analysis and publishing makefile Added Paths: ----------- numa-bench/trunk/doc/ numa-bench/trunk/doc/Makefile numa-bench/trunk/doc/analysis.txt Added: numa-bench/trunk/doc/Makefile =================================================================== --- numa-bench/trunk/doc/Makefile (rev 0) +++ numa-bench/trunk/doc/Makefile 2008-02-15 01:44:23 UTC (rev 418) @@ -0,0 +1,24 @@ +PROJECT := numa-bench +WEBDIR := assorted/htdocs/$(PROJECT) +HTMLFRAG := ../../../assorted-site/trunk +PANDOC = pandoc -s -S --tab-stop=2 -c ../main.css -H $(HTMLFRAG)/header.html -A $(HTMLFRAG)/google-footer.html -o $@ $^ + +all: index.html analysis.html + +index.html: ../README + $(PANDOC) + +analysis.html: analysis.txt + $(PANDOC) + +publish: analysis.html index.html + ssh shell-sf mkdir -p $(WEBDIR)/graphs/ + scp $^ shell-sf:$(WEBDIR)/ + +publish-data: ../tools/graphs/*.pdf + scp $^ shell-sf:$(WEBDIR)/graphs/ + +clean: + rm -f index.html analysis.html + +.PHONY: clean publish publish-data Added: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt (rev 0) +++ numa-bench/trunk/doc/analysis.txt 2008-02-15 01:44:23 UTC (rev 418) @@ -0,0 +1,68 @@ +% NUMA Benchmarks Analysis +% Yang Zhang + +The [graphs](graphs) show the results of running several different experiments. The +results are averaged across three trials for each experiment. The experiments +varied the following parameters: + +- number of threads (CPUs, 1-16, usually 16 if not testing scalability) +- size of the memory buffer to operate on (10MB, 100MB, or 1GB) +- number of times to repeat the operation (usually one) +- whether to chew through the memory sequentially or using random access +- whether to run operations in parallel on all the CPUs +- whether to explicitly pin the threads to a CPU (usually we do) +- whether to operate on a global buffer or on our own buffer (that we allocate + ourselves) or on buffers that all other nodes allocated (for + cross-communication) +- whether to perform writes to the buffer, otherwise just read + +Here are some questions these results help answer: + +- How much does working from another node affect throughput? + - It doesn't make much difference for sequential scans - this shows hardware + prefetching (and caching) at work. It still makes [a bit of + difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf). + - However, for random accesses, the difference is much more + [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf). +- How much difference is there between sequential scan and random access? + - Substantial difference. Also magnifies NUMA effects. Compare + [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf) + and + [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) +- Read vs. write + - Substantial difference. Random writes are ~2x slower than random reads. + - Compare + [a](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-0-cross-0.pdf) + and + [b](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf) +- Does `malloc` tend to allocate locally? + - Yes, because working with memory allocated from the current thread shows + improved times. +- Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes + - Graphs for each of these: + [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf) + vs. + [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf) + vs. + [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf) + - Local memory node access is best but still has problems scaling. The time + remains constant after some point. This is probably because increasing the + number of cores causes the load distribution to approach a more uniform + distribution. +- Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads + - Graphs for each of these: + [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf) + vs. + [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf) + vs. + [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf) + - Cross-communicating performs worse, and local memory node access performs + the same as shared memory access. This is expected, since we aren't + performing writes, so the data is freely replicated to all caches (same + reason that there is little difference between the non-parallel reads from + local vs. remote). + +There's still quite a bit of room to fill out this test suite. For instance, +the experiments varying the number of cores all exercise the fewest number of +chips; the results may be quite different for tests that distribute the loaded +cores across all chips. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |