Re: [Clockwork-developers] Architecture of job scheduler

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Thursday, January 10, 2002 9:46 PM
Subject: [Clockwork-developers] Architecture of job scheduler

> In the limited research I did looking at features of commercial job
> schedulers, I found an interesting idea. I had always taken for granted
> that any distributed job scheduler would need to have a central server
> process to manage the schedule and distribute the work of the schedule to
> other systems. But Orsyp's scheduler (called "Dollar Universe") has no
> central server -- although the schedule is still managed centrally.
>
> Does anyone have an opinion about this kind of design? Like I said, until
> I saw Dollar Universe, I had just assumed there would have to be a central
> event processor (to borrow a term from AutoSys). And a SQL-based database
> would make things easier to work with from a development perspective, but
> not until now did I realize that it might make the system less attractive
> for a user, since they would have to manage another database.

In my experience with Tibco software (namely the system monitoring tool they
call Hawk), I've encountered a similar configuration.  Hawk is a system
monitoring tool, designed to provide and act on statistical information
about systems.  It can be configured centrally, in other words, it's
possible to configure an entire environment using one GUI interface, but it
has no central server.  I'm not sure exactly how this is implemented, except
for the following bits of information:

1) Each server contains its own database of configuration data, and
presumably at least some configuration data for other servers.

2) The system itself is monitored by running an application which listens
for broadcast or multicast traffic from the agents running on each node.

3) Information about, say, a server going down is either discovered by any
client running at the time (Hey, this server went away!), or is reported by
agents on other nodes noticing that the server went away.

My opinions on this type of design are as follows:

First off, we'd have to understand that in order to take away the need for a
user to manage a central database, we'd need to find a way to create (and
maintain automatically) individual databases on each node.  Perhaps we could
create an XML schema that would allow us to store the database information
on each server.  We'd want to keep in mind that using this type of design
will likely preclude us from using more advanced database features (such as
triggers, internalized locking mechanisms, etc.).  If we want these
features, and we want to have the information decentralized, we might look
into whether there are any "mini-database" systems available.  I believe I
recall reading some time ago about a version of MySQL that had been made for
implementations such as this -- where you wouldn't necessarily want to have
an actual database server running, but you might want to be able to use a
slimmed-down SQL database.

Given that we can come up with an acceptable database framework to use, I
think decentralizing the database is a pretty good idea.  The client-server
mechanism that Autosys uses seems somewhat flawed to me.  For starters,
what's the point of storing all the jobs that are going to run on server A
on server B?  What happens when server A goes down?  Server B has to deal
with this fact.  If the configurations were decentralized, and server A went
down, it would simply have to worry about recovering its own jobs.  Autosys
suffers in this regard.  We are all familiar with Autosys chase alarms
occurring when a server is down.  These are (in my opinion) a result of the
client-server design -- the server relies on a (pretty much) stateless
client to provide state information, which doesn't persist between reboots.

If there were no server, the "clients" ( probably actually "agents") would
_have_ to maintain all their own relevant state information, so when the
system crashes, recovery would be no more difficult than noting which jobs
were running at the time of the crash, and either marking them as in a
"failed" state (or "terminated"), and/or optionally re-starting them.

We'd also want to think about how we'd implement a monitoring application.
How would it know which servers to connect to?  Maybe a configuration file?
How network-intensive will it be to monitor a large network of servers with
such a tool?  This might be one disadvantage of such an architecture.  Would
it be possible to generate an overall visual picture of job flows using this
architecture?

Before I shut up, I wanted to bring up another idea I had:  Could we include
the capability to have a single logical "job" run on multiple machines?  In
other words, considering all the jobs in use for CHRONOS, when we run jobs
in a "distributed" fashion on multiple application servers, could we allow
users somehow to consider those as a unit.  This might make visualization of
the schedule simpler.

OK, I'm done.

-Brian