Thread: [Assorted-commits] SF.net SVN: assorted: [451] numa-bench/trunk/doc/analysis.txt

Brought to you by: yangzhang

assorted-commits

[Assorted-commits] SF.net SVN: assorted: [451] numa-bench/trunk/doc/analysis.txt

From: <yan...@us...> - 2008-02-15 15:54:16

Revision: 451
          http://assorted.svn.sourceforge.net/assorted/?rev=451&view=rev
Author:   yangzhang
Date:     2008-02-15 07:54:21 -0800 (Fri, 15 Feb 2008)

Log Message:
-----------
noted josmp

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	2008-02-15 06:39:22 UTC (rev 450)
+++ numa-bench/trunk/doc/analysis.txt	2008-02-15 15:54:21 UTC (rev 451)
@@ -66,3 +66,5 @@
 the experiments varying the number of cores all exercise the fewest number of
 chips; the results may be quite different for tests that distribute the loaded
 cores across all chips.
+
+*Update*: all tests were performed on josmp.csail.mit.edu.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [517] numa-bench/trunk/doc/analysis.txt

From: <yan...@us...> - 2008-02-26 20:49:51

Revision: 517
          http://assorted.svn.sourceforge.net/assorted/?rev=517&view=rev
Author:   yangzhang
Date:     2008-02-26 12:49:46 -0800 (Tue, 26 Feb 2008)

Log Message:
-----------
fixed links

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	2008-02-26 20:48:46 UTC (rev 516)
+++ numa-bench/trunk/doc/analysis.txt	2008-02-26 20:49:46 UTC (rev 517)
@@ -32,9 +32,9 @@
 - Read vs. write
   - Substantial difference. Random writes are ~2x slower than random reads.
   - Compare
-    [a](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-0-cross-0.pdf)
+    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf)
     and
-    [b](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
 - Does `malloc` tend to allocate locally?
   - Yes, because working with memory allocated from the current thread shows
     improved times.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [546] numa-bench/trunk/doc/analysis.txt

From: <yan...@us...> - 2008-02-29 16:34:23

Revision: 546
          http://assorted.svn.sourceforge.net/assorted/?rev=546&view=rev
Author:   yangzhang
Date:     2008-02-29 08:34:24 -0800 (Fri, 29 Feb 2008)

Log Message:
-----------
updates to analysis

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	2008-02-29 16:34:11 UTC (rev 545)
+++ numa-bench/trunk/doc/analysis.txt	2008-02-29 16:34:24 UTC (rev 546)
@@ -1,9 +1,10 @@
 % NUMA Benchmarks Analysis
 % Yang Zhang
 
-The [graphs](graphs) show the results of running several different experiments. The
-results are averaged across three trials for each experiment. The experiments
-varied the following parameters:
+All tests were performed on `josmp.csail.mit.edu`.  The [graphs](graphs) show
+the results of running several different experiments. The results are averaged
+across three trials for each experiment. The experiments varied the following
+parameters:
 
 - number of threads (CPUs, 1-16, usually 16 if not testing scalability)
 - size of the memory buffer to operate on (10MB, 100MB, or 1GB)
@@ -15,6 +16,9 @@
   ourselves) or on buffers that all other nodes allocated (for
   cross-communication)
 - whether to perform writes to the buffer, otherwise just read
+- in experiments varying the number of cores $k$ working concurrently: whether
+  we're using cores 1 through $k$ or cores across the nodes in round-robin
+  fashion
 
 Here are some questions these results help answer:
 
@@ -66,5 +70,3 @@
 the experiments varying the number of cores all exercise the fewest number of
 chips; the results may be quite different for tests that distribute the loaded
 cores across all chips.
-
-*Update*: all tests were performed on josmp.csail.mit.edu.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [604] numa-bench/trunk/doc/analysis.txt

From: <yan...@us...> - 2008-03-05 06:11:34

Revision: 604
          http://assorted.svn.sourceforge.net/assorted/?rev=604&view=rev
Author:   yangzhang
Date:     2008-03-04 22:11:35 -0800 (Tue, 04 Mar 2008)

Log Message:
-----------
updated for new results

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	2008-03-05 06:11:23 UTC (rev 603)
+++ numa-bench/trunk/doc/analysis.txt	2008-03-05 06:11:35 UTC (rev 604)
@@ -1,6 +1,10 @@
 % NUMA Benchmarks Analysis
 % Yang Zhang
 
+__Updates__
+
+- 3/4/08: updated scalability experiments
+
 All tests were performed on `josmp.csail.mit.edu`.  The [graphs](graphs) show
 the results of running several different experiments. The results are averaged
 across three trials for each experiment. The experiments varied the following
@@ -8,14 +12,15 @@
 
 - number of threads (CPUs, 1-16, usually 16 if not testing scalability)
 - size of the memory buffer to operate on (10MB, 100MB, or 1GB)
-- number of times to repeat the operation (usually one)
+- number of operations, i.e. reads/writes (frequently 10 million)
+- number of times to repeat the chewing (usually 1)
 - whether to chew through the memory sequentially or using random access
 - whether to run operations in parallel on all the CPUs
 - whether to explicitly pin the threads to a CPU (usually we do)
 - whether to operate on a global buffer or on our own buffer (that we allocate
   ourselves) or on buffers that all other nodes allocated (for
   cross-communication)
-- whether to perform writes to the buffer, otherwise just read
+- whether operations are writes or reads
 - in experiments varying the number of cores $k$ working concurrently: whether
   we're using cores 1 through $k$ or cores across the nodes in round-robin
   fashion
@@ -25,48 +30,71 @@
 - How much does working from another node affect throughput?
   - It doesn't make much difference for sequential scans - this shows hardware
     prefetching (and caching) at work. It still makes [a bit of
-    difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf).
+    difference](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf).
   - However, for random accesses, the difference is much more
-    [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf).
+    [pronounced](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf).
 - How much difference is there between sequential scan and random access?
   - Substantial difference. Also magnifies NUMA effects. Compare
-    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     and
-    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Read vs. write
-  - Substantial difference. Random writes are ~2x slower than random reads.
+  - Substantial difference. Random writes are over 2x slower than random reads.
   - Compare
-    [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf)
+    [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     and
-    [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Does `malloc` tend to allocate locally?
   - Yes, because working with memory allocated from the current thread shows
     improved times.
 - Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes
-  - Graphs for each of these:
-    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf)
+  - Throughputs for sequential scans:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf)
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf)
-  - Local memory node access is best but still has problems scaling. The time
-    remains constant after some point. This is probably because increasing the
-    number of cores causes the load distribution to approach a more uniform
-    distribution.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Throughputs for random access:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 - Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads
-  - Graphs for each of these:
-    [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf)
+  - Throughputs for sequential scans:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf)
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
     vs.
-    [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf)
-  - Cross-communicating performs worse, and local memory node access performs
-    the same as shared memory access. This is expected, since we aren't
-    performing writes, so the data is freely replicated to all caches (same
-    reason that there is little difference between the non-parallel reads from
-    local vs. remote).
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Throughputs for random access:
+    [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+  - Speedup graphs:
+    [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
+    vs.
+    [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf)
 
-There's still quite a bit of room to fill out this test suite. For instance,
-the experiments varying the number of cores all exercise the fewest number of
-chips; the results may be quite different for tests that distribute the loaded
-cores across all chips.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Assorted-commits] SF.net SVN: assorted: [732] numa-bench/trunk/doc/analysis.txt

From: <yan...@us...> - 2008-05-08 19:17:19

Revision: 732
          http://assorted.svn.sourceforge.net/assorted/?rev=732&view=rev
Author:   yangzhang
Date:     2008-05-08 12:17:15 -0700 (Thu, 08 May 2008)

Log Message:
-----------
quick update

Modified Paths:
--------------
    numa-bench/trunk/doc/analysis.txt

Modified: numa-bench/trunk/doc/analysis.txt
===================================================================
--- numa-bench/trunk/doc/analysis.txt	2008-05-08 19:17:00 UTC (rev 731)
+++ numa-bench/trunk/doc/analysis.txt	2008-05-08 19:17:15 UTC (rev 732)
@@ -3,6 +3,7 @@
 
 __Updates__
 
+- 3/5/08: updated scalability experiments; added lessons learned
 - 3/4/08: updated scalability experiments
 
 All tests were performed on `josmp.csail.mit.edu`.  The [graphs](graphs) show


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.