This is a very short (for me) writeup on a proposal
for accelerating process migration for SIMD-style
applications. It will NOT be useful for MIMD or
The way SIMD generally operates is to have an
identical process migrated to, or started on, a very
large number of nodes simultaneously. MPI is designed
to work this way, it is generally not used either
RPC-style or socket-style to connect to any arbitrary
piece of code with the right interface.
That being the case, it occurred to me that it makes
no sense whatsoever to start N processes on the master
node and then farm them out sequentially. They're all
the same, so starting one is no different from
starting a thousand - except that it'll be faster and
At the meeting, I proposed the following for the
migration mechanism. Start up a reference copy and
farm that out as many times as you need to all the
recipients. BUT, if you use NORM or FLUTE for the file
transfer mechanism, you only actually transmit one
copy - all intended recipients would listen for and
upload that copy into memory, assign the universal IDs
at that point, and then signal the originator of what
those IDs are, along with the usual state info.
(Normally, the originator would assign the IDs, but
it's better if the recipients do in the case of a
The originator then creates the necessary stubs and
populates them with the necessary information.
Why is this a useful method?
It's not significantly more complex, as
implementations of NORM and FLUTE already exist. We
don't have to write those, we don't have to care about
that part of the operation at all. In fact, in some
ways it is simpler, as you don't have to juggle very
large numbers of processes or flood the network with
However, it should be faster. Clusters with a large
number of nodes, or running very large SIMD programs,
would have far shorter startup times. It should also
be far more stable for very large clusters, where the
number of user processes for migration become large
enough to disrupt scheduling or resource management.
This is only at the start, though. Does it improve
performance after that?
Well, maybe. It depends on how we handle the stubs.
Each process must have its own set of state
information, but that can be done just as easily in a
list. Certainly, since one physical device can only be
doing one thing, we only need one local I/O thread.
Since such systems generally operate on the idea of
scatter/gather, you probably only need one
communications thread that can switch between targets.
This does improve performance and it also improves
scalability as your limit is in kernel memory and not
kernel thread table space.
Alternatively, you could slave all the stubs to the
master stub for that SIMD process. In that case, you
minimize changes but still get most of the performance
increases - both at the start and when running.
As with all of these things, the possibilities are
endless. The challenge will be to find a possibility
that looks like it'll do what we want and home in on
Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool.