[Clockwork-developers] More design ideas
Status: Planning
Brought to you by:
jlouder
|
From: Joel L. <jo...@lo...> - 2002-12-14 23:48:07
|
I'd like to propose a more detailed design for the job scheduler. This is based on my ideas in the last email, and some more research into Berkeley DB. In this mail, I'll attempt to explain how the scheduler could be implemented as a multiple-master using Berkeley DB. In every collection of systems running a job schedule, there must always exist at least two "master" nodes. As I mentioned previously, these nodes don't have to be dedicated systems. Being a master node just means that the system has some additional responsibilities in the schedule. The databases representing jobs, events, and such would be Berkeley DB databases. Exactly one of the master nodes would be responsible for handling database writes (a requirement of Berkeley DB is that only one system can make updates). This system could be called the "write master." The other master nodes have copies of the databases, which they update when they receive replication messages from the write master. They can make read-only queries against their data, but they can't update it -- updates must be made to the write master. Let me pause for a moment to explain how replication and failover works using the Berkeley DB API. We are responsible for setting up and maintaining a communications infrastructure between the write master and the other masters. We provide a function to the Berkeley DB API that it can call on the write master to send data to another system. We are in control of who is the write master, although the API will help us figure it out by supporting elections. There is a well-defined procedure for adding a new subscriber system which will get that system caught up on the current state of the database(s) and will start feeding the system updates. We're also responsible for making sure writes only happen on the write master. If we want, the API will guarantee that an update is committed to all replica databases before returning from the commit. This may be useful, but may make writes too slow. Having at least two master nodes -- a write master and one or more additional masters -- meets our HA requirement. If one goes down, we can promote another system in the environment to master and, if necessary, hold an election to determine the write master. In addition to managing the databases of jobs, the masters are responsible for running the jobs and deciding when it's time to run a job. Much of this can be distributed among all the masters, letting them share the burden of this work. Suppose that the job schedule defines that at 5:00 PM, 100 jobs are supposed to run on various systems in the environment. The write master would be responsible for checking the time and determining that it's time to start the 100 jobs (I haven't figured out how to distribute that part). He's occasionally checking the clock to see if time-based jobs should start, and when he sees that these 100 jobs should start, he sticks 100 events in the "jobs-to-be-started" queue. This queue is one of the databases, which means that all the masters have access to it. (One of the database types that Berkeley DB supports is a queue, with an atomic "eat-from-the-head" operation.) The jobs-to-be-started queue is seen by all the masters, which are occasionally checking it for work. When one sees some work, it grabs some number of jobs off the head of the queue (it must talk to the write master to do this, since this is not a read-only query). Each master takes a portion of the jobs to be started, tries to start them by talking to the client system, and updates the write master again with the status of the job (either "started" or "failed to start"). When the write master gets updates as to the final status of jobs when they complete (I haven't covered exactly how that happens, but assume that word eventually gets back to the write master), the write master sticks the name of each newly-finished job on the "newly-finished" queue. The purpose of this queue is to distribute the work of determining whether the completion of a job means that any other jobs should now be started. All the masters are checking this queue periodically also, and will grab a chunk of jobs to process from it. Determining the successors of jobs efficiently will require that we keep a database of successors, keyed by completing job. For example, if jobs B and C should start when job A finishes, looking up job A in the successors database will return the names of jobs B and C. So in order to process work from the newly-finished queue, all a master must do is search through the successors database, determine if the dependency is met (i.e., did job A finish with failure, but the dependency is success-only), and if it is, stick more job-start events on the to-be-started queue (by talking to the write master again). These job-start events will in turn get distributed among multiple masters for processing. It might sound like the write master has a far greater burden of the work, but it doesn't necessarily have to be that way. If we configure the database replication so that replication is synchronous, we can guarantee that the all the other masters are as up-to-date as the write master. Then, those masters can query their own, local copies of the databases to find out job definitions and successor information. They'll only need to involve the write master when an update needs to be made. Sure, there's a performance penalty from making the replication synchronous, but if all the masters need to be working from the guaranteed-latest data, then we have to pay that penalty somehow. Either the database layer can do it for us, or we can do it ourselves by making all the other masters have to talk to the write master to make even read-only queries to be sure the data is current. The overhead is probably the same, so I'd rather let the database do the work. I haven't fully fleshed-out how the client updates will get back to the write master, but I was thinking of something like having the clients get a list of all the masters when a job is started. When it finishes, the client will try to contact one of the systems in that list to report status. If that system isn't the write master, it will take care of getting the update to the write master. If that system is down (or no longer a master), the client will move on to the next system in the list. As long as the set of masters doesn't change by 100% while a job runs, then this will work. And assuming all masters have current job status data, a JobScape-like GUI could attach to any of them. Being able to distribute this work will really help out, especially if there are a lot of GUI users. So the only things that the write master does that the other masters can't share in is make all the database updates (and we can't split that up), and check the clock periodically and see if any time-based jobs need to start. If we keep another database keyed by time, it shouldn't be too much work to do that task. That's why I consider this somewhere in between the single-master and distributed-across-all-nodes approaches. I really think this will scale quite well. Do any of you have any thoughts on this design? I need to try and shoot holes in it and see if I spent enough time thinking this up. -- Joel |