The higher latency doesn't really surprise me. Things are often fast first in MVAPCICH1 before
Without more details about what Ben suggests, I'm not sure I can offer any constructive comments. NUMA and other architectureal quirks affects are weird and hard to predict. Also, it's unclear if the on-node parts should be done with MPI or through some threadding model (OpenMP, pthreads, boost threads, etc. etc.). It's also unclear yet whether it's worth doing this sort of thing at alll given that MVAPICH is already multi-core aware and the on-node communications are done via shared memory.
The bigger concern in my opinion is the number of tasks trying to communicate out from a node. It's not yet clear to me whether having one task per node do the communications for that node is better than having all tasks on that node do their own thing. With the former, there's either an aggregation cost (for pure MPI codes) or a NUMA cost (for hybrind MPI/threads codes) to be paid, but you will potentially save the NUMA costs of the DMAs by the HCA to memory that isn't next to the core that's directly connected to the PCIe bus and the contention for HCA resources imposed by having 8+ tasks sharing the HCA on node and the cost required for the HCA to talk to potentially 8+ times more tasks in other parts of the network. The NUMA costs are probably a wash, but the copy cost of the aggregation vs. contention for the HCA is a more difficult comparison.
I'm willing to bet that the answer is, as is it is much more frequently these days, "It depends!" It depends mostly, I think, on the communication pattern in your code and the amount of memory each task is using. We already see on Ranger that MPI_Alltoall()-based codes would prefer a different task layout on the network than the point-to-point codes even though we nominally have a fully non-blocking switch. Likewise, if your messages are small, it's probably better to avoid the extra copies involved in having only one task per node do that node's communications (given the overheads).
What this really all means is that I should sit down and write some test code to figure it all out.
Bill Barth, Ph.D., Manager HPC Applications Group
bbarth@... | Phone: (512) 232-7069
Office: ROC 1.405 | Fax: (512) 475-9445
> -----Original Message-----
> From: Derek Gaston [mailto:friedmud@...]
> Sent: Tuesday, July 22, 2008 10:28 AM
> To: Benjamin Kirk
> Cc: "libmesh-devel@..."; Bill Barth
> Subject: Re: [Libmesh-devel] Recursive partitioning
> Definitely interesting numbers. What I find most interesting is that
> MVAPICH2 has higher latency than MVAPICH1... any ideas about that?
> Do you have an idea about how you would actually implement this using
> Metis / ParMetis?
> On Jul 22, 2008, at 9:03 AM, Benjamin Kirk wrote:
> > Check out attached...
> > I've been doing some MPI profiling on my 4-socket, dual-core per node
> > Opteron cluster. I've been curious for a while about "multilevel
> > domain
> > decomposition" for this class of architectures - e.g.
> > (1) partition into the number of nodes
> > (2) partition each subdomain into the number of processors per node
> > Since the on-node communication is cheaper than off-node
> > communication, it
> > would seem there is performance to gain here (especially in terms of
> > latency).
> > What do y'all think? The mvapich-1 intra-node latency numbers are
> > really
> > impressive!
> > -Ben
> > <
> > latency
> > .pdf
> > >
> > <
> > bw
> > .pdf
> > >
> > ---------------------------------------------------------------------
> > This SF.Net email is sponsored by the Moblin Your Move Developer's
> > challenge
> > Build the coolest Linux based applications with Moblin SDK & win
> > great prizes
> > Grand prize is a trip for two to an Open Source event anywhere in
> > the world
> > http://moblin-
> > Libmesh-devel mailing list
> > Libmesh-devel@...
> > https://lists.sourceforge.net/lists/listinfo/libmesh-devel