[Planetlab-tools] keeping services running tools (was 'Planetlab related questions')
Brought to you by:
alklinga
From: Adams, R. <rob...@in...> - 2003-10-06 21:46:24
|
Below, Vivek describes some of the mechanisms his team has created to keep the CoDeeN service running on all nodes. What are the rest of the PlanetLab users doing to keep their services running? If several people could describe what they're doing (and possibly provide some code) I'll pull together a HOWTO. -- RA -----Original Message----- From: pla...@li... [mailto:pla...@li...]On Behalf Of Bowman, Mic Sent: Monday, October 06, 2003 9:04 AM To: Vivek Pai; Brent N. Chun Cc: sk...@cs...; pla...@li... Subject: [Planetlab-users] RE: [Planetlab-support] Re: Planetlab related questions moving this discussion to planetlab-users... --Mic -----Original Message----- From: pla...@li... [mailto:pla...@li...] On Behalf Of Vivek Pai Sent: Saturday, October 04, 2003 08:20 PM To: Brent N. Chun Cc: sk...@cs...; pla...@li... Subject: Re: [Planetlab-support] Re: Planetlab related questions Brent N. Chun wrote: > I'm sure Vivek also has a bunch of machinery to keep CoDeeN running=20 > all the time on PlanetLab. Vivek? This may be more than what you bargained for :-) We have the following: a) a monitoring process on each node that tries to make sure that all of the CoDeeN processes are alive, and restarts it if they aren't b) a centrally-run sweep that checks every node every five minutes to make sure that the monitoring process is alive, and it restarts everything if the process is dead c) version numbers in the intra-CoDeeN communications protocol, such that nodes with different versions ignore each other d) a daily sweep of all "important" files in CoDeeN - we checksum each file on each node, and decide majorities, quorums, etc The last two items make sure that if a node is unreachable for a while (especially while we do an upgrade), it won't cause too much damage when it comes back up. It'll generally be ignored by the other nodes for a day, and we'll catch it in our admin e-mail the next morning. We don't do any automatic "get the latest version" kind of checks, because we often will stage our rollout of new versions, or we test our alpha code on a few live nodes from time to time. Our update process consists of scp'ing a set of files to all of the nodes, and then doing (on each node) 1) stop all processes 2) copy the new files into place 3) restart all processes This lets us have downtimes of about 20 seconds per node when we do a rollout of new code, and it works pretty well. We did have a weird case where the node died in step 2, leaving only some files updated. When it came back up, it refused to be restarted by step (b) above, but when step (d) did the checks, we saw the problem right away. The one thing we don't do right now is grab all of our log files and store them centrally, but this has not been much of an issue yet. We'll probably start doing that soon, though, just so we can free up disk space. Our compressed logs are approaching 500MB on some nodes. -Vivek |