[Assorted-commits] SF.net SVN: assorted: [418] numa-bench/trunk

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 418
          http://assorted.svn.sourceforge.net/assorted/?rev=418&view=rev
Author:   yangzhang
Date:     2008-02-14 17:44:23 -0800 (Thu, 14 Feb 2008)

Log Message:
-----------
added analysis and publishing makefile

Added Paths:
-----------
    numa-bench/trunk/doc/
    numa-bench/trunk/doc/Makefile
    numa-bench/trunk/doc/analysis.txt

Added: numa-bench/trunk/doc/Makefile
===================================================================

--- numa-bench/trunk/doc/Makefile	                        (rev 0)
+++ numa-bench/trunk/doc/Makefile	2008-02-15 01:44:23 UTC (rev 418)
@@ -0,0 +1,24 @@
+PROJECT  := numa-bench
+WEBDIR   := assorted/htdocs/$(PROJECT)
+HTMLFRAG := ../../../assorted-site/trunk
+PANDOC    = pandoc -s -S --tab-stop=2 -c ../main.css -H $(HTMLFRAG)/header.html -A $(HTMLFRAG)/google-footer.html -o $@ $^
+
+all: index.html analysis.html
+
+index.html: ../README
+	$(PANDOC)
+
+analysis.html: analysis.txt
+	$(PANDOC)
+
+publish: analysis.html index.html
+	ssh shell-sf mkdir -p $(WEBDIR)/graphs/
+	scp $^ shell-sf:$(WEBDIR)/
+
+publish-data: ../tools/graphs/*.pdf
+	scp $^ shell-sf:$(WEBDIR)/graphs/
+
+clean:
+	rm -f index.html analysis.html
+
+.PHONY: clean publish publish-data

Added: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	                        (rev 0)
+++ numa-bench/trunk/doc/analysis.txt	2008-02-15 01:44:23 UTC (rev 418)
@@ -0,0 +1,68 @@
+% NUMA Benchmarks Analysis
+% Yang Zhang
+
+The [graphs](graphs) show the results of running several different experiments. The
+results are averaged across three trials for each experiment. The experiments
+varied the following parameters:
+
+- number of threads (CPUs, 1-16, usually 16 if not testing scalability)
+- size of the memory buffer to operate on (10MB, 100MB, or 1GB)
+- number of times to repeat the operation (usually one)
+- whether to chew through the memory sequentially or using random access
+- whether to run operations in parallel on all the CPUs
+- whether to explicitly pin the threads to a CPU (usually we do)
+- whether to operate on a global buffer or on our own buffer (that we allocate
+  ourselves) or on buffers that all other nodes allocated (for
+  cross-communication)
+- whether to perform writes to the buffer, otherwise just read
+
+Here are some questions these results help answer:
+
+- How much does working from another node affect throughput?
+  - It doesn't make much difference for sequential scans - this shows hardware
+    prefetching (and caching) at work. It still makes [a bit of
+    difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf).
+  - However, for random accesses, the difference is much more
+    [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf).
+- How much difference is there between sequential scan and random access?
+  - Substantial difference. Also magnifies NUMA effects. Compare
+    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    and
+    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
+- Read vs. write
+  - Substantial difference. Random writes are ~2x slower than random reads.
+  - Compare
+    [a](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-0-cross-0.pdf)
+    and
+    [b](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf)
+- Does `malloc` tend to allocate locally?
+  - Yes, because working with memory allocated from the current thread shows
+    improved times.
+- Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes
+  - Graphs for each of these:
+    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf)
+    vs.
+    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf)
+    vs.
+    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf)
+  - Local memory node access is best but still has problems scaling. The time
+    remains constant after some point. This is probably because increasing the
+    number of cores causes the load distribution to approach a more uniform
+    distribution.
+- Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads
+  - Graphs for each of these:
+    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf)
+    vs.
+    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf)
+    vs.
+    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf)
+  - Cross-communicating performs worse, and local memory node access performs
+    the same as shared memory access. This is expected, since we aren't
+    performing writes, so the data is freely replicated to all caches (same
+    reason that there is little difference between the non-parallel reads from
+    local vs. remote).
+
+There's still quite a bit of room to fill out this test suite. For instance,
+the experiments varying the number of cores all exercise the fewest number of
+chips; the results may be quite different for tests that distribute the loaded
+cores across all chips.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.