[Clockwork-developers] More design ideas

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'd like to propose a more detailed design for the job scheduler. This is
based on my ideas in the last email, and some more research into Berkeley DB.
In this mail, I'll attempt to explain how the scheduler could be implemented
as a multiple-master using Berkeley DB.

In every collection of systems running a job schedule, there must always
exist at least two "master" nodes. As I mentioned previously, these nodes
don't have to be dedicated systems. Being a master node just means that the
system has some additional responsibilities in the schedule.

The databases representing jobs, events, and such would be Berkeley DB
databases. Exactly one of the master nodes would be responsible for handling
database writes (a requirement of Berkeley DB is that only one system can
make updates). This system could be called the "write master." The other
master nodes have copies of the databases, which they update when they
receive replication messages from the write master. They can make read-only
queries against their data, but they can't update it -- updates must be
made to the write master.

Let me pause for a moment to explain how replication and failover works
using the Berkeley DB API. We are responsible for setting up and maintaining
a communications infrastructure between the write master and the other
masters. We provide a function to the Berkeley DB API that it can call on
the write master to send data to another system. We are in control of who
is the write master, although the API will help us figure it out by
supporting elections. There is a well-defined procedure for adding a new
subscriber system which will get that system caught up on the current state
of the database(s) and will start feeding the system updates. We're also
responsible for making sure writes only happen on the write master. If we
want, the API will guarantee that an update is committed to all replica
databases before returning from the commit. This may be useful, but may make
writes too slow.

Having at least two master nodes -- a write master and one or more additional
masters -- meets our HA requirement. If one goes down, we can promote another
system in the environment to master and, if necessary, hold an election to
determine the write master.

In addition to managing the databases of jobs, the masters are responsible for
running the jobs and deciding when it's time to run a job. Much of this can
be distributed among all the masters, letting them share the burden of this
work.

Suppose that the job schedule defines that at 5:00 PM, 100 jobs are supposed
to run on various systems in the environment. The write master would be
responsible for checking the time and determining that it's time to start
the 100 jobs (I haven't figured out how to distribute that part). He's
occasionally checking the clock to see if time-based jobs should start, and
when he sees that these 100 jobs should start, he sticks 100 events in the
"jobs-to-be-started" queue. This queue is one of the databases, which means
that all the masters have access to it. (One of the database types that
Berkeley DB supports is a queue, with an atomic "eat-from-the-head" operation.)

The jobs-to-be-started queue is seen by all the masters, which are
occasionally checking it for work. When one sees some work, it grabs some
number of jobs off the head of the queue (it must talk to the write master
to do this, since this is not a read-only query). Each master takes a portion
of the jobs to be started, tries to start them by talking to the client
system, and updates the write master again with the status of the job (either
"started" or "failed to start").

When the write master gets updates as to the final status of jobs when they
complete (I haven't covered exactly how that happens, but assume that word
eventually gets back to the write master), the write master sticks the name
of each newly-finished job on the "newly-finished" queue. The purpose of
this queue is to distribute the work of determining whether the completion
of a job means that any other jobs should now be started. All the masters
are checking this queue periodically also, and will grab a chunk of jobs to
process from it.

Determining the successors of jobs efficiently will require that we keep a
database of successors, keyed by completing job. For example, if jobs B
and C should start when job A finishes, looking up job A in the successors
database will return the names of jobs B and C. So in order to process
work from the newly-finished queue, all a master must do is search through
the successors database, determine if the dependency is met (i.e., did job
A finish with failure, but the dependency is success-only), and if it is,
stick more job-start events on the to-be-started queue (by talking to the
write master again). These job-start events will in turn get distributed
among multiple masters for processing.

It might sound like the write master has a far greater burden of the work,
but it doesn't necessarily have to be that way. If we configure the database
replication so that replication is synchronous, we can guarantee that the
all the other masters are as up-to-date as the write master. Then, those
masters can query their own, local copies of the databases to find out
job definitions and successor information. They'll only need to involve the
write master when an update needs to be made.

Sure, there's a performance penalty from making the replication synchronous,
but if all the masters need to be working from the guaranteed-latest data,
then we have to pay that penalty somehow. Either the database layer can
do it for us, or we can do it ourselves by making all the other masters
have to talk to the write master to make even read-only queries to be
sure the data is current. The overhead is probably the same, so I'd rather
let the database do the work.

I haven't fully fleshed-out how the client updates will get back to the
write master, but I was thinking of something like having the clients get a
list of all the masters when a job is started. When it finishes, the client
will try to contact one of the systems in that list to report status.
If that system isn't the write master, it will take care of getting the
update to the write master. If that system is down (or no longer a master),
the client will move on to the next system in the list. As long as the
set of masters doesn't change by 100% while a job runs, then this will work.

And assuming all masters have current job status data, a JobScape-like
GUI could attach to any of them. Being able to distribute this work will
really help out, especially if there are a lot of GUI users.

So the only things that the write master does that the other masters can't
share in is make all the database updates (and we can't split that up),
and check the clock periodically and see if any time-based jobs need to start.
If we keep another database keyed by time, it shouldn't be too much work to
do that task. That's why I consider this somewhere in between the single-master
and distributed-across-all-nodes approaches. I really think this will
scale quite well.

Do any of you have any thoughts on this design? I need to try and shoot holes
in it and see if I spent enough time thinking this up.

-- 
Joel