Re: [Clockwork-developers] Implementation of a decentralized schedule

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Saturday, January 12, 2002 4:11 PM
Subject: [Clockwork-developers] Implementation of a decentralized schedule

> Don't get me wrong, I really like the idea of not needing a central server
> process and database (multiplied by two for redundancy), but I can't quite
> figure out how to make this work without them.

How about this:  a "client" (used for monitoring) that connects to each
server in an environment, based on either a predetermined configuration, or
some sort of autodiscovery method.  Yes, the client would need to talk to
each server, but I think some way to centralize monitoring is going to be a
requirement, whether the data is centralized or not.  The way I see it, we
have the following options if we want to decentralize things:

1) Allow the client to have a predetermined list of servers (possibly in a
configuration file), and have it make TCP connections to each server as
needed, sending all communications through these servers.  This isn't
difficult to implement, so I'd suggest that even if we go with a more
complex design, we might want to implement something like this,  even if
only really for testing purposes.

2) Have the client broadcast on its local networks (UDP), and autodiscover
servers.  This might work, especially, if we make each of the servers know
about each of the other servers in its environment.  If you think about it,
the servers will need to know which other servers are part of their
environment just to execute the schedule and to know if servers are down.
The client could broadcast on its local network, and then grab a server list
from one server who responds, then opening TCP connections to communicate
with the individual servers.

3) Implement two methods of communication:  a protocol for the client to
directly request information from the servers (this would be done using
unicast TCP connections), and then the usage of IP multicast for the sending
of event-related information between the servers and the clients.  This
would keep us from having to use broadcasts (to start up the client, method
1 or 2 could be used), and make it easy for all the servers to know about
event that occur.

4) Use broadcasts for communications which apply to more than one server,
and use unicast TCP for point-to-point communication.

Of these, if we want to decentralize the application, I'm most in favor of
the third option:  using multicast to send event-related messages, and TCP
unicast for point-to-point communications.  Here's how it would work in some
scenarios:

Scenario 1:  Job failure.  The server on which the job failed sends a
multicast message to the other servers in its environment (and any clients
which may be running) to inform them of the failure.  From there, servers
could respond to the failure as necessary...running other jobs, displaying
an alert (on the client), automatically notifying administrators, whatever.

Scenario 2:  Job dependency between two servers.  On completion of the first
job, its scheduling daemon sends a multicast message to the group informing
them of the completion of the first job and its status.  The server on which
the second job runs receives this information and begins the dependent job
if the first job was successful.

Scenario 3:  Force Start of a Job.  While monitoring the distributed
application, an administrator wants to start a job on demand.  The client
opens a unicast TCP connection to the server in question and issues the
command to start the job.  That server then sends a multicast message so
that the other servers can act on the starting of that job if necessary.

I hope this is enough for everyone to see how the idea might work.  Using
plain old TCP will work too, but will require a lot of connections between a
lot of servers.  If we can design two protocols so that we use multicast and
unicast together, each where it makes the most sense to do so, I think we
can accomplished the decentralized design.

I don't think relying solely on point-to-point communications will be very
scalable in a decentralized design.  The reason is that, for an environment
of n servers, we might need up to nC2 connections.  Using multicast, each
server listens to a single multicast group address, and a lot of the traffic
would go over that connection.  When point-to-point connections are required
(should be relatively infrequently), they can be opened and then closed.
Multicast might be a little more difficult to implement, but it will reward
us in terms of scalability and resource usage.

So, what does everyone think?

-Brian