From: Greg H. <gh...@ps...> - 2006-11-04 00:58:20
|
Upinder S. Bhalla writes: > Hi, Greg, > Here's my first question: a general design issue for scaling. I'm > thinking of giving each node a postmaster for all the nodes it connects > to. Each postmaster has a buffer for outgoing and incoming data. Seems > easy enough for small systems, but I'm not sure how to deal with scaling > here. Is this something I need to worry about, and what kinds of designs > do people use for really big systems? I know that Neuron uses global > message sends, but that is for spike data and I don't think it would be > so good for stuff that has to go on each timestep. > > -- Upi Upi, I'm not sure exactly how you are thinking of the postmasters, but I feel very strongly that there should not be postmaster objects of the sort that are present in PGENESIS. In PGENESIS, the postmaster is an object that is no different than any other object in the model. That is, it is visible to users as part of the element tree, and "showmsg" shows connections as going to or coming from the postmaster. However, it fundamentally *is* different than other model constructs, and has no place at the model level. It is something that should be completely invisible to users, both in specifying models, and in examining them. In other words, users should see internode connections as going from A to B, not as going from A to a postmaster, and then from another postmaster to B. The postmaster concept may, however, be valid at the simulator level, and implementing it as a C++ object within MOOSE (but not as a MOOSE object itself) could be a legitimate realization of this concept. OK, on to the question about scaling. The issues involved in scaling to node counts of 16 or less are not that critical, and many different approaches will work almost equally well. Things start to get bad with larger numbers of nodes, but how many is really model- and program-dependent. For concreteness, let's consider a 1024 node system. Now, imagine if on every timestep every node has to send a message (in the MPI sense, not the GENESIS sense) to every other node. Each node will have to send and receive 1024 probably small messages (lets assume 100 bytes apiece here), and the overheads will kill performance. One possible way of dealing with this is to organize the nodes into a 32x32 array, where information that has to get from one node to another does not get there in one hop, but in 2 hops (first vertical, then horizontal). After the first hop, MPI messages are broken apart, the data are sorted and reassembled into messages, and then sent onward to their final destination. So, for every simulation timestep, we have 2 transfers of messages among the nodes, and on each transfer, each node will communicate with 32 other nodes. Thus, we have decreased the total number of messages sent from 1024*1024 to 1024*32*2, or a factor of 16. Each message on average will be about 32 times as large as before, or 3200 bytes, but the cost of sending a 3200 byte message may be not that much greater than for a 100 byte message, so the net effect might be speedup of nearly 16 versus the naive approach. This is just one possible solution; the choice of a particular solution in a real application depends on a lot of factors. Incidentally, you may have noticed there is an MPI_Alltoall function that is supposed to efficiently transfer data among a set of processors, and it does. However, it assumes fixed size messages, and that may not be the case for a neural simulation, and so one has to shoehorn what one wants to do into the semantics of MPI_Alltoall. Also, MPI_Alltoall is a collective operation, meaning that all nodes must participate simultaneouly, which might be OK for a synchronous simulation (global timesteps), but clashes with asynchronous styles of simulation. --Greg P.S. I hope you don't mind -- I am CC'ing the moose-g3 list since we are getting into general design issues that other people might be interested in. |