From: Greg H. <gh...@ps...> - 2006-08-28 23:17:00
|
Upi, I have been looking at the MOOSE code, and thinking about certain issues involved in parallelizing it, and have some serious concerns. The greatest concern I have is with the many places in the basecode that make an implicit assumption that elements are locally resident in the nodes's memory, and that only one thread will be actively modifying them. For example, if the elements are distributed over many nodes, then Element::relativeFind() will potentially require information on 2 or more nodes. This will cause the code to block for indefinite periods of time while the interprocess communication is performed and the remote nodes do what they need to do. The simplest way of dealing with this would be to only allow one active thread over the entire set of nodes on which MOOSE is running. However, this would be disastrous in term of performance -- network setup would be much slower than doing it on a single node. If we allow multiple active threads on each node to avoid the performance hit, then every method that directly or indirectly calls one of these methods that require off-node information will potentially block. While this occurs, incoming requests from other nodes must be handled, and some of those may involve the Element in question. Some form of locking will thus be needed (probably on a per-Element basis). The difficult thing is that each of the places in the code where a potentially blocking call will occur will have to release the Element lock, and must leave the Element (as well as any kernel data structures) in a safe and consistent state. I can't see this being done without rewriting many sections of code. The most troublesome situations will be when modifications are being made to the element tree, such as when new elements are being created or old ones destroyed. Once the network is set up, things may not be so bad, but the network needs to get set up in order to run it. One solution may be to standardize at the .mh level. The existing MOOSE code could support running models (i.e., a script + a set of .mh files) on serial machines, and we could have a separately developed parallel version that can run the same models. A few changes would probably still be needed to the existing .mh files, but probably not too many. This approach might make sense if nearly all the visualization and other add-on code would be at the .mh level or higher, but not if those things require major changes to the existing basecode. If you have thought of solutions to any of these problems, I would be interested in hearing about them. --Greg |