Thread: [Assorted-commits] SF.net SVN: assorted: [451] numa-bench/trunk/doc/analysis.txt
Brought to you by:
yangzhang
From: <yan...@us...> - 2008-02-15 15:54:16
|
Revision: 451 http://assorted.svn.sourceforge.net/assorted/?rev=451&view=rev Author: yangzhang Date: 2008-02-15 07:54:21 -0800 (Fri, 15 Feb 2008) Log Message: ----------- noted josmp Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-02-15 06:39:22 UTC (rev 450) +++ numa-bench/trunk/doc/analysis.txt 2008-02-15 15:54:21 UTC (rev 451) @@ -66,3 +66,5 @@ the experiments varying the number of cores all exercise the fewest number of chips; the results may be quite different for tests that distribute the loaded cores across all chips. + +*Update*: all tests were performed on josmp.csail.mit.edu. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-02-26 20:49:51
|
Revision: 517 http://assorted.svn.sourceforge.net/assorted/?rev=517&view=rev Author: yangzhang Date: 2008-02-26 12:49:46 -0800 (Tue, 26 Feb 2008) Log Message: ----------- fixed links Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-02-26 20:48:46 UTC (rev 516) +++ numa-bench/trunk/doc/analysis.txt 2008-02-26 20:49:46 UTC (rev 517) @@ -32,9 +32,9 @@ - Read vs. write - Substantial difference. Random writes are ~2x slower than random reads. - Compare - [a](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-0-cross-0.pdf) + [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf) and - [b](graphs/ncores-16-size-1000000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) - Does `malloc` tend to allocate locally? - Yes, because working with memory allocated from the current thread shows improved times. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-02-29 16:34:23
|
Revision: 546 http://assorted.svn.sourceforge.net/assorted/?rev=546&view=rev Author: yangzhang Date: 2008-02-29 08:34:24 -0800 (Fri, 29 Feb 2008) Log Message: ----------- updates to analysis Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-02-29 16:34:11 UTC (rev 545) +++ numa-bench/trunk/doc/analysis.txt 2008-02-29 16:34:24 UTC (rev 546) @@ -1,9 +1,10 @@ % NUMA Benchmarks Analysis % Yang Zhang -The [graphs](graphs) show the results of running several different experiments. The -results are averaged across three trials for each experiment. The experiments -varied the following parameters: +All tests were performed on `josmp.csail.mit.edu`. The [graphs](graphs) show +the results of running several different experiments. The results are averaged +across three trials for each experiment. The experiments varied the following +parameters: - number of threads (CPUs, 1-16, usually 16 if not testing scalability) - size of the memory buffer to operate on (10MB, 100MB, or 1GB) @@ -15,6 +16,9 @@ ourselves) or on buffers that all other nodes allocated (for cross-communication) - whether to perform writes to the buffer, otherwise just read +- in experiments varying the number of cores $k$ working concurrently: whether + we're using cores 1 through $k$ or cores across the nodes in round-robin + fashion Here are some questions these results help answer: @@ -66,5 +70,3 @@ the experiments varying the number of cores all exercise the fewest number of chips; the results may be quite different for tests that distribute the loaded cores across all chips. - -*Update*: all tests were performed on josmp.csail.mit.edu. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-03-05 06:11:34
|
Revision: 604 http://assorted.svn.sourceforge.net/assorted/?rev=604&view=rev Author: yangzhang Date: 2008-03-04 22:11:35 -0800 (Tue, 04 Mar 2008) Log Message: ----------- updated for new results Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-03-05 06:11:23 UTC (rev 603) +++ numa-bench/trunk/doc/analysis.txt 2008-03-05 06:11:35 UTC (rev 604) @@ -1,6 +1,10 @@ % NUMA Benchmarks Analysis % Yang Zhang +__Updates__ + +- 3/4/08: updated scalability experiments + All tests were performed on `josmp.csail.mit.edu`. The [graphs](graphs) show the results of running several different experiments. The results are averaged across three trials for each experiment. The experiments varied the following @@ -8,14 +12,15 @@ - number of threads (CPUs, 1-16, usually 16 if not testing scalability) - size of the memory buffer to operate on (10MB, 100MB, or 1GB) -- number of times to repeat the operation (usually one) +- number of operations, i.e. reads/writes (frequently 10 million) +- number of times to repeat the chewing (usually 1) - whether to chew through the memory sequentially or using random access - whether to run operations in parallel on all the CPUs - whether to explicitly pin the threads to a CPU (usually we do) - whether to operate on a global buffer or on our own buffer (that we allocate ourselves) or on buffers that all other nodes allocated (for cross-communication) -- whether to perform writes to the buffer, otherwise just read +- whether operations are writes or reads - in experiments varying the number of cores $k$ working concurrently: whether we're using cores 1 through $k$ or cores across the nodes in round-robin fashion @@ -25,48 +30,71 @@ - How much does working from another node affect throughput? - It doesn't make much difference for sequential scans - this shows hardware prefetching (and caching) at work. It still makes [a bit of - difference](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf). + difference](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf). - However, for random accesses, the difference is much more - [pronounced](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf). + [pronounced](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf). - How much difference is there between sequential scan and random access? - Substantial difference. Also magnifies NUMA effects. Compare - [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0.pdf) + [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-0-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) and - [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Read vs. write - - Substantial difference. Random writes are ~2x slower than random reads. + - Substantial difference. Random writes are over 2x slower than random reads. - Compare - [a](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0.pdf) + [a](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) and - [b](graphs/ncores-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/nworkers-16-size-100000000-nreps-1-shuffle-1-par-0-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Does `malloc` tend to allocate locally? - Yes, because working with memory allocated from the current thread shows improved times. - Scalability of: cross-node memory writes vs. shared memory writes vs. local node memory writes - - Graphs for each of these: - [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1.pdf) + - Throughputs for sequential scans: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0.pdf) + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0.pdf) - - Local memory node access is best but still has problems scaling. The time - remains constant after some point. This is probably because increasing the - number of cores causes the load distribution to approach a more uniform - distribution. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Throughputs for random access: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-1-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) - Scalability of: cross-node memory reads vs. shared memory reads vs. local node memory reads - - Graphs for each of these: - [a](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1.pdf) + - Throughputs for sequential scans: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [b](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0.pdf) + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) vs. - [c](graphs/scaling-size-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0.pdf) - - Cross-communicating performs worse, and local memory node access performs - the same as shared memory access. This is expected, since we aren't - performing writes, so the data is freely replicated to all caches (same - reason that there is little difference between the non-parallel reads from - local vs. remote). + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-0-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Throughputs for random access: + [a](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/scaling-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + - Speedup graphs: + [a](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-1-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [b](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-0-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) + vs. + [c](graphs/speedup-size-1000000000-opcount-10000000-nreps-1-shuffle-1-par-1-pin-1-local-1-write-0-cross-0-rrnodes-1-nnodes-4-ncpus-16.pdf) -There's still quite a bit of room to fill out this test suite. For instance, -the experiments varying the number of cores all exercise the fewest number of -chips; the results may be quite different for tests that distribute the loaded -cores across all chips. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <yan...@us...> - 2008-05-08 19:17:19
|
Revision: 732 http://assorted.svn.sourceforge.net/assorted/?rev=732&view=rev Author: yangzhang Date: 2008-05-08 12:17:15 -0700 (Thu, 08 May 2008) Log Message: ----------- quick update Modified Paths: -------------- numa-bench/trunk/doc/analysis.txt Modified: numa-bench/trunk/doc/analysis.txt =================================================================== --- numa-bench/trunk/doc/analysis.txt 2008-05-08 19:17:00 UTC (rev 731) +++ numa-bench/trunk/doc/analysis.txt 2008-05-08 19:17:15 UTC (rev 732) @@ -3,6 +3,7 @@ __Updates__ +- 3/5/08: updated scalability experiments; added lessons learned - 3/4/08: updated scalability experiments All tests were performed on `josmp.csail.mit.edu`. The [graphs](graphs) show This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |