[Assorted-commits] SF.net SVN: assorted: [604] numa-bench/trunk/doc/analysis.txt

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 604
          http://assorted.svn.sourceforge.net/assorted/?rev=604&view=rev
Author:   yangzhang
Date:     2008-03-04 22:11:35 -0800 (Tue, 04 Mar 2008)

Log Message:
-----------
updated for new results

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================

--- numa-bench/trunk/doc/analysis.txt	2008-03-05 06:11:23 UTC (rev 603)
+++ numa-bench/trunk/doc/analysis.txt	2008-03-05 06:11:35 UTC (rev 604)
@@ -1,6 +1,10 @@
 % NUMA Benchmarks Analysis
 % Yang Zhang
 
+__Updates__
+
+- 3/4/08: updated scalability experiments
+
 All tests were performed on `josmp.csail.mit.edu`.  The [graphs](graphs) show
 the results of running several different experiments. The results are averaged
 across three trials for each experiment. The experiments varied the following
@@ -8,14 +12,15 @@
 
 - number of threads (CPUs, 1-16, usually 16 if not testing scalability)
 - size of the memory buffer to operate on (10MB, 100MB, or 1GB)
-- number of times to repeat the operation (usually one)
+- number of operations, i.e. reads/writes (frequently 10 million)
+- number of times to repeat the chewing (usually 1)
 - whether to chew through the memory sequentially or using random access
 - whether to run operations in parallel on all the CPUs
 - whether to explicitly pin the threads to a CPU (usually we do)
 - whether to operate on a global buffer or on our own buffer (that we allocate
   ourselves) or on buffers that all other nodes allocated (for
   cross-communication)
-- whether to perform writes to the buffer, otherwise just read
+- whether operations are writes or reads
 - in experiments varying the number of cores $k$ working concurrently: whether
   we're using cores 1 through $k$ or cores across the nodes in round-robin
   fashion
@@ -25,48 +30,71 @@
 - How much does working from another node affect throughput?
   - It doesn't make much difference for sequential scans - this shows hardware
     prefetching (and caching) at work. It still makes [a bit of
-    difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf).
+    difference](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf).
   - However, for random accesses, the difference is much more
-    [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf).
+    [pronounced](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf).
 - How much difference is there between sequential scan and random access?
   - Substantial difference. Also magnifies NUMA effects. Compare
-    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     and
-    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Read vs. write
-  - Substantial difference. Random writes are ~2x slower than random reads.
+  - Substantial difference. Random writes are over 2x slower than random reads.
   - Compare
-    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf)
+    [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     and
-    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Does `malloc` tend to allocate locally?
   - Yes, because working with memory allocated from the current thread shows
     improved times.
 - Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes
-  - Graphs for each of these:
-    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf)
+  - Throughputs for sequential scans:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf)
-  - Local memory node access is best but still has problems scaling. The time
-    remains constant after some point. This is probably because increasing the
-    number of cores causes the load distribution to approach a more uniform
-    distribution.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Throughputs for random access:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads
-  - Graphs for each of these:
-    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf)
+  - Throughputs for sequential scans:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf)
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf)
-  - Cross-communicating performs worse, and local memory node access performs
-    the same as shared memory access. This is expected, since we aren't
-    performing writes, so the data is freely replicated to all caches (same
-    reason that there is little difference between the non-parallel reads from
-    local vs. remote).
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Throughputs for random access:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 
-There's still quite a bit of room to fill out this test suite. For instance,
-the experiments varying the number of cores all exercise the fewest number of
-chips; the results may be quite different for tests that distribute the loaded
-cores across all chips.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.