From: Aleksander Korzynski <A.K<orzynski@el...> - 2006-04-04 02:47:39
I have attempted to design and will implement elements of an automatic
distributed checkpoint-recovery subsystem for OpenSSI. A complete subsystem
would periodically take coordinated checkpoints and automatically recover in
case of a node failure. There would be output commit implemented with the usage
of non-deterministic event logging. Unsupported non-deterministic events would
be masked with minimal coordinated checkpointing.
In the implementation, intra-process communication will be assumed to occur
via the cluster-wide IPC mechanisms. Communication via network sockets will
not be supported, because currently OpenSSI only supports migration of network
sockets used for communication with the outside world (via the CVIP), and not
of network sockets bound to local IP adresses intended for intra-cluster
I have written a paper outlining the design of such a subsystem. It was written
for the purpose of my master thesis, which will consist of designing and
implementing elements of the proposed subsystem.
You will only be interested in Sections 4 and 5.
Are you interested in incorporating the subsystem into the mainline OpenSSI?
I would appreciate feedback on the design and the implementation problems
that I wrote about in the paper.
Additionally, according to some old posts on the mailing list, some work on
checkpointing has already been done, although it is not usable. However, the
posts don't explain what work it was exactly. What was it and why was it not