[Clockwork-developers] Centralized vs. decentralized design issues

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

First off, Happy Thanksgiving to everyone on the list! Hopefully you won't
mind some actual traffic on this mailing list (in addition to the monthly
mailman announcement).

I was doing some thinking the past couple of weeks about the clockwork
project, believe it or not, and Shawn McMahon and I had a few minutes to talk
about design the other day. As you may recall, things pretty much stalled out
while we were trying to work out whether or not to make the scheduler
centralized (like AutoSys) or decentralized (like something else we've not
really worked with, but think would be better).

Everyone agrees that AutoSys has some pretty bad bottlenecks, and I think
that's what has made many of us (myself included) want to steer clear of
a centralized design. But as Shawn pointed out to me last week, there's a
good chance that some well-applied multithreading could make AutoSys scale
a whole lot better. I've heard that there's some maximum number of events
per second that can be processed by an event processor, regardless of
how much horsepower you have. To me, this sounds like there's some important
stuff in AutoSys that isn't multithreaded.

The appeal to me of the single-master/centralized design is its simplicity.
The distributed design sounds great, but it also sounds very complex,
possibly requiring us to do multicast notifications and implement a tiny
little publish/subscribe system. A single-master design would make things
simpler to implement.

What are the things we hate about AutoSys' single-master design?
(1) It won't scale past 5,000 jobs. As I said before, I think we can fix that
with multithreading.
(2) It requires a dedicated pair of scheduling machines. We can eliminate
this requirement for small schedules if the event processor is fast enough
and we make configuration easy enough. An administrator could elect to
"promote" a couple of the managed systems to run the event processor.

We could even design the multithreading so that some of the event processing
work could be done not just by another thread on the scheduling server,
but by another scheduling machine altogether. For instance, when you look at
a job in AutoSys that's about to be started, the state of the event is
briefly "PG" for "ProcessinG" while the EP dispatches it and talks to the
client. Imagine if there were multiple machines processing events, and
the status were set to "Processing by machine A." Something as simple
as that could off-load part of the burden of event-processing to multiple
systems.

And if the system were flexible enough to allow the scheduling servers to
be easily set up (unlike AutoSys), they could be easily moved around either
by the administrator or perhaps automatically, based on load averages. Now
we've got a system that behaves as the distributed model, but isn't too much
more complicated than the plain single-master model.

There's also the issue of databases. Our AutoSys administrators don't like
its requirement of a SQL database because then they have to get DBA support.
But a SQL database sure makes some things easier for the programmers. I
looked briefly at SQLite [1], an embedded SQL database engine. It's kind
of neat -- you get SQL queries and even transaction support fully contained
within your application; the database lives in a file on the filesystem.
But it doesn't support object types on columns (any column can hold
anything), and it's unclear how well it holds up under a load of concurrent
users (all its benchmarks are single-user).

There's also Berkeley DB (which I know Shawn McMahon despises), which claims
to support transactions and failover. If this is robust enough and easy
enough to work with, then it might be the answer -- giving the programmers
something that does the work of a database while appearing invisible to the
end user.

Of course, most people have a SQL database running *somewhere*, and is it
really a big deal if we tell them they need to host another database on it,
particularly if we didn't require a specific vendor's database?

I'll spend some more time thinking about exactly what the responsibilities
of the event processor are and trying to find ways to easily distribute them
over a few machines. In the mean time, if you have any thoughts, please
send them to the list.

The bottom line is that I really think we can take the AutoSys EP model,
apply some well-placed multithreading and distributed computing, we'd have
a system that with a simple design and the scalability we want.

[1] http://www.hwaci.com/sw/sqlite

-- 
Joel