Re: [Clockwork-developers] Architecture of job scheduler
Status: Planning
Brought to you by:
jlouder
|
From: Charles B. H. <br...@do...> - 2002-01-11 15:35:05
|
----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Thursday, January 10, 2002 9:46 PM Subject: [Clockwork-developers] Architecture of job scheduler > In the limited research I did looking at features of commercial job > schedulers, I found an interesting idea. I had always taken for granted > that any distributed job scheduler would need to have a central server > process to manage the schedule and distribute the work of the schedule to > other systems. But Orsyp's scheduler (called "Dollar Universe") has no > central server -- although the schedule is still managed centrally. > > Does anyone have an opinion about this kind of design? Like I said, until > I saw Dollar Universe, I had just assumed there would have to be a central > event processor (to borrow a term from AutoSys). And a SQL-based database > would make things easier to work with from a development perspective, but > not until now did I realize that it might make the system less attractive > for a user, since they would have to manage another database. In my experience with Tibco software (namely the system monitoring tool they call Hawk), I've encountered a similar configuration. Hawk is a system monitoring tool, designed to provide and act on statistical information about systems. It can be configured centrally, in other words, it's possible to configure an entire environment using one GUI interface, but it has no central server. I'm not sure exactly how this is implemented, except for the following bits of information: 1) Each server contains its own database of configuration data, and presumably at least some configuration data for other servers. 2) The system itself is monitored by running an application which listens for broadcast or multicast traffic from the agents running on each node. 3) Information about, say, a server going down is either discovered by any client running at the time (Hey, this server went away!), or is reported by agents on other nodes noticing that the server went away. My opinions on this type of design are as follows: First off, we'd have to understand that in order to take away the need for a user to manage a central database, we'd need to find a way to create (and maintain automatically) individual databases on each node. Perhaps we could create an XML schema that would allow us to store the database information on each server. We'd want to keep in mind that using this type of design will likely preclude us from using more advanced database features (such as triggers, internalized locking mechanisms, etc.). If we want these features, and we want to have the information decentralized, we might look into whether there are any "mini-database" systems available. I believe I recall reading some time ago about a version of MySQL that had been made for implementations such as this -- where you wouldn't necessarily want to have an actual database server running, but you might want to be able to use a slimmed-down SQL database. Given that we can come up with an acceptable database framework to use, I think decentralizing the database is a pretty good idea. The client-server mechanism that Autosys uses seems somewhat flawed to me. For starters, what's the point of storing all the jobs that are going to run on server A on server B? What happens when server A goes down? Server B has to deal with this fact. If the configurations were decentralized, and server A went down, it would simply have to worry about recovering its own jobs. Autosys suffers in this regard. We are all familiar with Autosys chase alarms occurring when a server is down. These are (in my opinion) a result of the client-server design -- the server relies on a (pretty much) stateless client to provide state information, which doesn't persist between reboots. If there were no server, the "clients" ( probably actually "agents") would _have_ to maintain all their own relevant state information, so when the system crashes, recovery would be no more difficult than noting which jobs were running at the time of the crash, and either marking them as in a "failed" state (or "terminated"), and/or optionally re-starting them. We'd also want to think about how we'd implement a monitoring application. How would it know which servers to connect to? Maybe a configuration file? How network-intensive will it be to monitor a large network of servers with such a tool? This might be one disadvantage of such an architecture. Would it be possible to generate an overall visual picture of job flows using this architecture? Before I shut up, I wanted to bring up another idea I had: Could we include the capability to have a single logical "job" run on multiple machines? In other words, considering all the jobs in use for CHRONOS, when we run jobs in a "distributed" fashion on multiple application servers, could we allow users somehow to consider those as a unit. This might make visualization of the schedule simpler. OK, I'm done. -Brian |