[Planetlab-tools] keeping services running tools (was 'Planetlab related questions')

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Below, Vivek describes some of the mechanisms his team has created
to keep the CoDeeN service running on all nodes.

What are the rest of the PlanetLab users doing to keep their
services running?

If several people could describe what they're doing (and possibly
provide some code) I'll pull together a HOWTO.

-- RA

-----Original Message-----
From: pla...@li...
[mailto:pla...@li...]On Behalf Of Bowman,
Mic
Sent: Monday, October 06, 2003 9:04 AM
To: Vivek Pai; Brent N. Chun
Cc: sk...@cs...; pla...@li...
Subject: [Planetlab-users] RE: [Planetlab-support] Re: Planetlab related
questions

moving this discussion to planetlab-users...

--Mic

-----Original Message-----
From: pla...@li...
[mailto:pla...@li...] On Behalf Of
Vivek Pai
Sent: Saturday, October 04, 2003 08:20 PM
To: Brent N. Chun
Cc: sk...@cs...; pla...@li...
Subject: Re: [Planetlab-support] Re: Planetlab related questions

Brent N. Chun wrote:
> I'm sure Vivek also has a bunch of machinery to keep CoDeeN running=20
> all the time on PlanetLab.  Vivek?

This may be more than what you bargained for :-)

We have the following:
a) a monitoring process on each node that tries to make sure that
    all of the CoDeeN processes are alive, and restarts it if they
    aren't
b) a centrally-run sweep that checks every node every five minutes
    to make sure that the monitoring process is alive, and it restarts
    everything if the process is dead
c) version numbers in the intra-CoDeeN communications protocol,
    such that nodes with different versions ignore each other
d) a daily sweep of all "important" files in CoDeeN - we checksum
    each file on each node, and decide majorities, quorums, etc

The last two items make sure that if a node is unreachable for a while
(especially while we do an upgrade), it won't cause too much damage when
it comes back up. It'll generally be ignored by the other nodes for a
day, and we'll catch it in our admin e-mail the next morning. We don't
do any automatic "get the latest version" kind of checks, because we
often will stage our rollout of new versions, or we test our alpha code
on a few live nodes from time to time.

Our update process consists of scp'ing a set of files to all of the
nodes, and then doing (on each node)
1) stop all processes
2) copy the new files into place
3) restart all processes

This lets us have downtimes of about 20 seconds per node when we do a
rollout of new code, and it works pretty well. We did have a weird case
where the node died in step 2, leaving only some files updated. When it
came back up, it refused to be restarted by step (b) above, but when
step (d) did the checks, we saw the problem right away.

The one thing we don't do right now is grab all of our log files and
store them centrally, but this has not been much of an issue yet. We'll
probably start doing that soon, though, just so we can free up disk
space. Our compressed logs are approaching 500MB on some nodes.

-Vivek