Re: [Moose-g3-devel] parallel MOOSE

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Upinder S. Bhalla writes:
 > Hi, Greg,
 >   Here's my first question: a general design issue for scaling. I'm
 > thinking of giving each node a postmaster for all the nodes it connects
 > to. Each postmaster has a buffer for outgoing and incoming data. Seems
 > easy enough for small systems, but I'm not sure how to deal with scaling
 > here. Is this something I need to worry about, and what kinds of designs
 > do people use for really big systems? I know that Neuron uses global
 > message sends, but that is for spike data and I don't think it would be
 > so good for stuff that has to go on each timestep.
 > 
 > -- Upi

Upi,
I'm not sure exactly how you are thinking of the postmasters, but I feel
very strongly that there should not be postmaster objects of the sort
that are present in PGENESIS.  In PGENESIS, the postmaster is an
object that is no different than any other object in the model.  That is,
it is visible to users as part of the element tree, and "showmsg" shows
connections as going to or coming from the postmaster.  However, it
fundamentally *is* different than other model constructs, and has no place
at the model level.  It is something that should be completely invisible
to users, both in specifying models, and in examining them.  In other words,
users should see internode connections as going from A to B, not as going
from A to a postmaster, and then from another postmaster to B.

The postmaster concept may, however, be valid at the simulator level,
and implementing it as a C++ object within MOOSE (but not as a MOOSE object
itself) could be a legitimate realization of this concept.

OK, on to the question about scaling.  The issues involved in scaling
to node counts of 16 or less are not that critical, and many different
approaches will work almost equally well.  Things start to get bad
with larger numbers of nodes, but how many is really model- and
program-dependent.  For concreteness, let's consider a 1024 node
system.  Now, imagine if on every timestep every node has to send a
message (in the MPI sense, not the GENESIS sense) to every other node.
Each node will have to send and receive 1024 probably small messages
(lets assume 100 bytes apiece here), and the overheads will kill
performance.  One possible way of dealing with this is to organize the
nodes into a 32x32 array, where information that has to get from one
node to another does not get there in one hop, but in 2 hops (first
vertical, then horizontal).  After the first hop, MPI messages are
broken apart, the data are sorted and reassembled into messages, and
then sent onward to their final destination.  So, for every simulation
timestep, we have 2 transfers of messages among the nodes, and on each
transfer, each node will communicate with 32 other nodes.  Thus, we
have decreased the total number of messages sent from 1024*1024 to
1024*32*2, or a factor of 16.  Each message on average will be
about 32 times as large as before, or 3200 bytes, but the cost of
sending a 3200 byte message may be not that much greater than for a
100 byte message, so the net effect might be speedup of nearly 16
versus the naive approach.  This is just one possible solution;
the choice of a particular solution in a real application depends on a
lot of factors.

Incidentally, you may have noticed there is an MPI_Alltoall
function that is supposed to efficiently transfer data among a set of
processors, and it does.  However, it assumes fixed size messages, and
that may not be the case for a neural simulation, and so one has to
shoehorn what one wants to do into the semantics of MPI_Alltoall.  Also,
MPI_Alltoall is a collective operation, meaning that all nodes must
participate simultaneouly, which might be OK for a synchronous simulation
(global timesteps), but clashes with asynchronous styles of simulation.
--Greg

P.S. I hope you don't mind -- I am CC'ing the moose-g3 list since we are
getting into general design issues that other people might be interested in.

Re: [Moose-g3-devel] parallel MOOSE

Multiscale Neuroscience and Systems Biology Simulator

Re: [Moose-g3-devel] parallel MOOSE