clockwork-developers Mailing List for Clockwork

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'm gonna go out on a limb and say we're not actually gonna do this. Right?

I've remembered a pet peeve about Autosys.  If you have a job on hold,
and you want to change it to on ice, and the flow has come to that job,
the job starts instead of going on ice.

That is not what most people expect when they try to put the job
on ice.  Let's make sure clockwork doesn't do that.  Instead, the job
should go on ice, and the flow should proceed past it.

--=20
Shawn McMahon           | All US citizens should immediately start open-
Episode IV Consulting   | signing their email messages as a voluntary act
System Administrator    | of patriotic duty. - Dr. Michael L. Love
and all-around nice guy | http://gnu-darwin.sourceforge.net/war.html

Shawn McMahon and I had a conversation weeks ago about logging. Depending
on how our users do their monitoring, they may need our software to log to
any number of different formats/destinations.

Shawn came up with a pretty comprehensive list:
 * file
 * syslog
 * SNMP trap
 * Windows event log

That should satisfy just about everyone. I can't imagine anyone who's serious
about monitoring their hosts and applications not being able to pick up
error messages from at least one of those sources.

The user should be able to configure multiple logging destinations, and
set a minimum severity for each. For example, it should be possible to
configure logging like:
 + DEBUG or higher -> /var/log/clockwork.log
 + INFO or higher  -> syslog
 + WARNING or higher -> send as SNMP trap

Having different places to log, each with their own semantics for severity,
will require some abstraction so that Clockwork can log to a common logging
API, and that thing can figure out how to translate it to the API of
a particular backend. Luckily, the nice folks at Apache's Jakarta project
have already done just this with the Commons Logging component. It's described
as "an ultra-thin bridge between different logging libraries." They
have created the common logging API, and make it easy for you to write a little
code that plugs in whatever backend you want to use to their logging
framework.

Incidentally, they've already got the "plug-in" for Log4J written, which
handles file and syslog logging. So all that's left is SNMP traps and
Windows event log. I did a little digging, and I'm afraid there's no
way to do the Windows event log in pure Java -- a .DLL must be involved.

If you're interested in doing some reading, here are a couple of references:
http://jakarta.apache.org/commons/logging.html
http://jakarta.apache.org/log4j/index.html

-- 
Joel

The other day I was thinking about how the Clockwork servers would store
their configuration data, which would be replicated between all the servers.
I made some notes on this, and will share them with everyone.

First off, when I talk about the configuration of the server, I'm not talking
about the job schedule. I'm talking about settings like: which mail server
to use, lists of clients and port numbers, which ciphers to use for SSL.

Since this configuration applies to the entire schedule "instance", it needs
to be replicated between all the servers. That makes it a natural fit for
one of the Berkeley databases, which are already getting replicated.

So now all that needs to be determined is how to modify the configuration.
For this, I'll borrow from some features of Veritas Cluster Server, which lets
you modify just about any aspect of configuration while the server is running
by using commands that really have no idea what you're tweaking. Here's what
I mean:

We put all the configuration data into a separate Berekely DB, called "config"
for example. Into the config database we store each item of the configuration,
storing the item name and then it's data. The data could be one of a few
"types": scalar, list, or hash. All any program that manipulates the config
database needs to know is how to add/modify/remove these configuration items.

Here's an example: when the scheduler needs to know what mail server to use,
it pulls the configuration item called "SMTPServer" out of the config
database. It's value is a scalar, and is the hostname of the SMTP server.
When the scheduler needs to know who all the clients are, it pulls out the
value of "Clients", which is a list of hostnames.

So the interface that reads and writes the configuration can be very simple,
and independent of the configuration data.

I've glossed over exactly who it is that's updating the config database --
is it the running server, or is it a command-line program that's directly
manipulating the database file?

Most of the time, it should be the running server. The config database is
replicated, so it's a bad idea to go mucking around with the database file
behind the server's back. But there could be times when all servers are down,
and you need to make a modification without starting the server. So we'll
need that capability as well.

As a safeguard, the config database should include a flag to indicate whether
the server is running. If a user tries to use the file-modifying configuration
method and the flag is set, the utility can refuse to make the change. Of
course, there will need to be a way for the user to explicitly ask to clear
the flag, for situations where the server didn't stop gracefully, so it
didn't get a chance to clear the flag.

If necessary, we can also provide a way to use the file-modifying configuration
method to load the configuration from a file, for users who are making lots
of changes at once. Changing the SMTP server is no big deal, but a user
who's configuring their server for the first time may have lots of
settings to change, and may not like the idea of running the configuration
command 30 or so times with different arguments.

This is all for the servers' configuration, of course. Clients will need
some configuration, too, but since it's unique to each client it's not
worth the effort to do something like this. The clients can just use
configuration files.

-- 
Joel

It's been a while, and I've played a bit more with Java's SSL support, and I'm
now convinced that SSL is the way to go for the client-to-server communication.

Interestingly enough, this week a vulnerability was found in SSL when used
in certain scenarios. But it doesn't look like it would impact Clockwork,
because it requires:
* The cipher must be used in CBC mode. The "best" cipher Java has available
for SSL is RC4, which isn't vulnerable to this attack because it's a stream
cipher.
* The protocol on top of SSL needs to have a fixed "password" at a certain
spot. As far as I can tell, Clockwork's protocol won't have this, because
there's no password to send. The attack works on things like IMAP, where the
connection always opens with "LOGIN username password" or something like that.
* The SSL implementation must "leak" error information by treating padding
errors and decryption errors differently. I'm sure that by the time we're
ready to release any files, Sun will have a patch for this (if their
implementation contains this flaw).

At first glance, SSL seems painful for Clockwork users to use, since each
node needs to have a private key, and must have the public key of all the
entities that will communicate with it. But I think this could be done rather
simply with just two key pairs:
(1) A server key pair
(2) A client key pair

Each server would have the server private key, and the client public key.
Likewise, each client would have the client private key and the server public
key. Sharing them makes things operationally more simple for the user, but
it increases the risk if a private key is compromised. So we'll build in a
way for the administrator to easily add keys to the "keyring" of the client
and the server, to make transitioning to a new client key pair (or server
key pair) easy, without taking down the scheduler. This would also make it
possible for the administrator to use multiple client keys or server keys 
if he chooses.

Now that I've beat security to death, I've got some ideas on other topics. But
I'll send those in a separate mail.

-- 
Joel

I have a few more ideas about security and want to float them past everyone.
I also have some more thoughts on a couple of other topics, but I'll leave
them to separate emails.

Currently, the plan for security calls for all conversations to be encrypted.
Shawn suggested the Rijndael cipher, which was selected to be the new AES
standard.

We'd use public key cryptography to let each end of the connection verify
the other end and securely exchange a Rijndael key to use for the rest of
the conversation (because public key ciphers are too slow for bulk
encryption).

After doing some more reading, I've learned that this is essentially SSL.
In addition, the SSL specification also includes cipher negotiation, so the
two ends can list the ciphers they support and choose the best mutually
available one. Java has support for SSL built-in (to Java 1.4, that is).

In most HTTP-over-SSL applications (like secure web sites), the client
checks the server's identity, but the server doesn't check the client's (at
the SSL layer). But SSL does have support for both ends verifying the other --
that's just not used as much. We'd need to do that.

The upside of using SSL is that we don't have to roll our own code for
exchanging keys, checking identities, and negotiating ciphers. Also, the
issue of changing the client/server keys I touched on a while back (without
giving an easy solution) gets easier, since Java has a concept of a keystore
and certificate store that we could easily add more keys to and later (when
the old keys should no longer be accepted) remove keys from.

The down side? It looks like the SSL support is restricted to only use
certain ciphers: RC4, DES, and Triple-DES (these are the symmetric algorithms,
I didn't list the public key algorithms). At least that's what the built-in
SSL provider from Sun supports. I'll have to dig around some more and see
if another provider is available, or if you can just plug in a cipher.

-- 
Joel

While the JCE (Java Cryptography Extension) provider built into the JDK 1.4
(SunJCE provider) doesn't support a tremendous amount of ciphers, I've
found another provider that does. Check out Cryptix JCE at:

http://cryptix.org/products/jce/index.html

This is a pure Java implementation of a JCE provider that supports lots
more ciphers, including AES, all under a free license. The catch: it's
"early access" quality.

If you're interested in seeing how JCE is used, I've attached a tiny
example program I used to encrypt a short string with AES. I had to beat
my head against the wall for quite a while to make this work before I
figured out that instead of asking for the "AES" algorithm I should ask for
the "Rijndael" algorithm.

-- 
Joel

On Sun, Jan 26, 2003 at 03:36:43PM -0500, Joel Loudermilk said:
>=20
> key. Shawn explained to me that the keys are symmetric, meaning that you
> can either encrypt a message with the public key and decrypt it with the
> private key, or you can encrypt with the private key and decrypt with the
> public key.

Actually, that makes them asymmetric, but you have the technical
gist correct.

> message. I suppose we could make this communication not encrypted, but ju=
st
> with a digital signature in it, but it seems awful easy to make the whole
> conversation encrypted, and then there are no issues with someone snooping
> secret data (although it's probably not very important data).

Since we have to set up a secure channel anyway, it may be not
much of a performance hit to use it for commands.  AES, DES, et.
al. were designed to be used on machines as stupid as smart
cards, so we should get acceptable performance.

> (2) How multiple servers work with authentication. My examples above talk
> about multiple agents and one server, but we know there will be multiple
> servers. My first thought is for all servers to have a copy of the
> private key. I think this would still be secure, and would allow everythi=
ng
> to continue to work as I described.

I agree.

> that level of security could be optional. Also, the administrator needs to
> be able to change the schedule's key pair if it's compromised. It would be
> nice if this could be done without stopping the servers (outages are bad,
> right?). If we make it difficult to change the key pair, the administrator
> will be less likely to do it.

We could make a mechanism for distribution of new keys over the
existing channel, but you wouldn't want it to be the only way,
in case your reason for sending new keys was because the old ones
were compromised.  But as a prophylactic measure it'd be useful.

--=20
Shawn McMahon         | Every time you walk out of the house
FedEx Services        | with clothes on, you give up freedom
DSS-MCO Security Lead | for temporary safety.

Shawn McMahon and I had a detailed discussion on security on Thursday. While
I don't have a handle on everything involved yet, I do think we made some
headway in the area of security. Here are the details of what we discussed.
(Shawn may have to correct me if I misstate anything.)

When an administrator sets up a schedule (or an "instance," to borrow
from AutoSys's terminology), one of the things that will be done is the
creation of a public/private key pair for the schedule. The scheduling server
will keep the private key (secured by appropriate filesystem permissions),
and all the clients (both GUIs and the agent software that runs on the
managed nodes to make them valid targets of jobs) will get the public
key. Shawn explained to me that the keys are symmetric, meaning that you
can either encrypt a message with the public key and decrypt it with the
private key, or you can encrypt with the private key and decrypt with the
public key.

First, how to make GUI clients (and command-line client programs used by
users) secure. As I mentioned before, I don't want to have to store user
and password lists inside Clockwork, and would like to at least give the
administrator the option of authenticating against something else. But this
means that the password from the GUI (or client program, rather, since it
might not be a GUI), will have to be sent to the server. We can't do a
challenge-response kind of thing, since if we're going to pass the password
off to some other backend we need to actually know what it is. So the
communications channel needs to be encrypted so we can safely pass the user's
password along. (I suppose the administrator will need to trust that we're
not secretly recording all the passwords. If he has doubts, he can just
read the source.)

When the client program and the server talk, they use public key encryption
to secretly exchange a "session key" that will be used to encrypt the
rest of their conversation. I suppose they could just use their existing
key pair to encrypt the entire conversation, but Shawn said that's way slower
than "secret key" encryption methods, so most people use the slow public
key encryption to securely exchange a secret key, which they then use to 
encrypt the rest of their conversation.

Now that the client and server have an encrypted communications channel, the
client can prompt the user for his username/password and send it to the server
without fear that it will be snooped or replayed. Some client programs (for
example, a command-line program to send an event or to display job history)
may be more useful if they don't ask for a password interactively, so the
administrator can set up a web page that runs those commands for instance.
In that case, we can make it so that the client programs can take the username
and password from the command line or from a file. This will make things
flexible enough so that the administrator can do whatever he wants with the
client programs.

While we don't want to force the administrator to create a list of
Clockwork-only usernames and passwords, some people might want to do just
that. To make things easy for them, we could make one of our authentication
backends check against an internal database. Berkeley DB provides an easy
way to encrypt a database, too. Other backends I have in mind are PAM (for
the UNIX folks) and calling an external program (for those who want to use
neither PAM nor the internal list). That should satisfy just about everyone.

The other part of security is the communications between the server and the
agents (the piece of Clockwork that runs on every machine that is able to
run jobs). Obviously, the agent must be able to tell that commands are
coming from the real scheduling server, or this would be a huge root exploit.
So when the scheduling server sends commands to the agents, it will
encrypt them using the schedule's private key. If the agent can decrypt the
message using the schedule's public key, then it knows it's a valid
message. I suppose we could make this communication not encrypted, but just
with a digital signature in it, but it seems awful easy to make the whole
conversation encrypted, and then there are no issues with someone snooping
secret data (although it's probably not very important data).

That's it for the server-to-agent connections, but there will also
be agent-to-server connections as well. Just like in AutoSys, if a job
is going to run for hours, we probably don't want to hold the TCP connection
open that long, just so that the agent can say at the end what the job's
exit code was. So the agent will have to initiate a connection back to the
server (or, a server, since there may be multiple servers) to provide the
exit code. In this scenario, the server needs to be able to make sure the
system that just connected to it is a valid agent. So the agent will encrypt
the message with the schedule's public key (which all agents have) and the
server will decrypt it with the schedule's private key. For added security,
we can allow the administrator to set up a list of agent IP addresses/networks.
The incoming connection can be checked against this list, too. For even
more security, the incoming connection can be checked against the IP address
of the system where the job ran. There's little reason that anyone other
than "hosta" should be reporting on the final status of a job that ran on
"hosta." But this level of security might be difficult to manage, since
most hosts have several IP addresses, and the one the server connected to
to get the job started might not be the one all the traffic out of the system
goes across.

What I haven't addressed here are:
(1) Authorization. We've checked to make sure the users are who they say they
are, but how do we know who should be allowed to do what? I was thinking
of a three-level access model: administrators can change things, operators
can only stop/start/hold jobs, and guests can look but not modify anything.
We may need to apply these to something more granular than the entire schedule,
since it may be useful to have a set of users who can only start/stop the
backup jobs, or can only modify a certain set of jobs.

(2) How multiple servers work with authentication. My examples above talk
about multiple agents and one server, but we know there will be multiple
servers. My first thought is for all servers to have a copy of the
private key. I think this would still be secure, and would allow everything
to continue to work as I described.

(3) Ease of maintenance. It needs to be easy for an administrator to change
a system's IP address. It's okay if we have this stored in our databases
somewhere, as long as we provide an easy way to change it when an host's
IP changes. Honestly, though, I'd prefer that we not be on the list of
things an administrator needs to do when he changes IP addresses. Maybe
that level of security could be optional. Also, the administrator needs to
be able to change the schedule's key pair if it's compromised. It would be
nice if this could be done without stopping the servers (outages are bad,
right?). If we make it difficult to change the key pair, the administrator
will be less likely to do it.

Comments, anyone?

-- 
Joel

Security is a big area to tackle in the scheduler, and it's important that it
be designed in correctly from the start. I see security used in two different
ways:
 (1) Making sure the scheduler processes on different systems communicating
     with each other are genuine and haven't been replaced with something
     that acts like the scheduler but really isn't.
 (2) Making sure that users running commands on the scheduling servers and
     through the GUI are who they say they are, so we can enforce restrictions.

I haven't devoted much thought to area #1, but I think the tougher part will
be area #2.

For the moment, let's assume that permissions restrictions are already set up,
meaning that there are rules in the scheduler saying things like "user
jlouder can do this." What I'm attempting to address is how we determine who
the user is -- without requiring the scheduler administrator to set up a
Clockwork-only user/password database.

For GUI users, we prompt them for a username and password. On the server
side, we let the administrator configure what we'll do to check usernames
and passwords. We could provide a couple of backends -- for example, one that
takes the username and password and feeds it to PAM (using a service name
of "clockwork"), and perhaps also one that feeds the username/password to
an external program, which the administrator could use to check the password
against just about anything (LDAP, NT domain, homegrown database, etc.).

For command-line users, we don't prompt for a password, but simply take the
identity of the user who's running the program. For example, if user 'jlouder'
is running the process, then we'll do all authorization checks with the
username 'jlouder'.

Of course, for a command-line program to do much of anything (say, put a
start-job event in the queue), it probably will need to talk to a scheduler
daemon running with elevated privileges. So there will need to be a check of
security credentials in that conversation. This might prevent us from not
asking local users for passwords, since we'll have to supply something to the
daemon to get checked. It's not safe to have the client tell the daemon
"I'm user XYZ, just trust me!" So obviously I need to do some more thinking
on this.

But my main goal is to *not* have to house a username/password database. I
hate applications that do that.

If you have any thoughts or comments on security, I'd love to hear them. As
you can see, I have about 20% of an idea here.

-- 
Joel

I've made a change to the "Database Design" document on the SourceForge site.
The start_times database, which had time-of-day as keys and job names as
data, has been replaced by two new databases: next_start_by_time and
next_start_by_job.

The purpose of these databases (and the one they replaced) is to handle
getting a job started when its start time rolls around. The old design would
have required the scheduler to look up "9:21 PM" in the database, start
any jobs for that time, and check again in one minute. This didn't support
jobs that run at odd intervals well (i.e., every 13 minutes), and it also
meant that if for some reason the scheduler didn't check for "9:22 PM" jobs
for some reason (maybe it had a lot of jobs to start at 9:21 PM) they'd
get missed. Also, if the scheduler is down, it would be hard to figure out
what starts got missed when it came back up.

The replacement databases are Btrees, which are automatically sorted by key.
If we track the next start time of every job (like AutoSys does), and store
it in next_start_by_time with the key being the UNIX-style time, then the
database will get sorted in chronological order.

Now scheduled starts are easy. Using a cursor on the database, it's simple to
say "give me the first record." That will be the first job to be started.
If it's not time to start that job, do nothing. If it is, start it and try
the next one. A side benefit of this technique is that if the scheduler is
down and misses a bunch of job starts, they'll all get started as soon as it
comes back up without any special coding for that scenario.

The next_start_by_job exists for the cases where we want to pull up the
next start of a particular job. I imagine the GUI will want to display this
to the user. This will be a secondary index on the first database, so Berkeley
DB will take care of keeping it in sync for us.

In other news, I've constructed a simple-minded Java example that uses the
Berkeley DB Queue format with multiple readers and one transaction-protected
writer, simulating the adding of events on to event_queue, and processing
of the events by the worker threads. After beating my head against the wall
several times, it's now working just fine.

I'm still unable to get gcj to compile a Java program that uses Berkeley DB
to native code. I'm not sure what's wrong. But I'm not too worried about that.

I believe the next thing to work on is to get a more detailed understanding
of how high availability will work, synchronizing databases and such. In a
previous mail, I had mentioned that I wanted to use the Berkeley DB
feature of making commits not return until the data had been pushed to all
replica databases. That simplifies the programming somewhat, but the more I
think about it the more that scares me. I suppose we have to deal with the
same scenario all database replication environments have -- what do you do
when the primary database fails, but some of the latest updates haven't been
applied to the replica database? I'll do some more thinking on this and
get back to everyone.

-- 
Joel

There's an article on Linux Journal about using GCJ to compile Java
code to either bytecode or native code that was just posted yesterday.
The guy who wrote it is one of the original GCJ developers:

http://www.linuxjournal.com/article.php?sid=4860

I was playing around with this earlier, and got the obligatory "Hello,
World!" example to compile into native code and run. I'm having a little
difficulty getting an example built that uses Berkeley DB, probably because
JNI (native interface) is involved. But the article says JNI works under
GCJ, so I probably just need to fiddle around with it.

When a program is compiled to native code with GCJ, it's linked against
libgcj, which provides (almost) all the classes that Sun's JRE provides (the
most notable exception is the AWT). There's even a garbage collector.

I need to do some more research to make sure that libgcj provides all the
classes/features we would want. I also hope that it's no more bug-ridden
than Sun's classes.

-- 
Joel

+- On Thursday (1/2/2003 9:20) Shawn McMahon <smc...@ei...> Wrote-
| I don't like the idea of packaging a JRE, they're huge, but if you don't
| you find that Java is not write-once run-anywhere, not even nearly so.

You're correct that we shouldn't expect to compile some Java bytecode
and let it loose on every platform, expecting it to work with whatever
JRE is installed there.

What I'm referring to, though, is the way that using Java would free the
programmer from needing to think much about the platform while writing the
code. It would be really nice not to have to look up every function to
make sure that POSIX guarantees it's there. And since we said we want to
support Windows (in some capacity), it would be _really_ nice not to have
to wrap code in "#ifdef WINDOWS". And it's just nicer to be able to open
a socket connection in 2 lines instead of 20.

| However, with the GNU Compiler, we could produce compiled native Java
| applications, on every target platform.  This would actually be easier
| than finding a JRE for all the target platforms and keeping it
| synchronized when we need new features.

If this works, it sounds great. I'd rather deliver native code for each
platform anyway, especially since using Berkeley DB would require that.

| I vote for whatever language you and Brian are best at coding.

Good point. I was working on the assumption (based on responses to emails
on this list and previous information I was told) that both Brian and
Shawn D. don't have very much free time to contribute, so I'd be doing the
vast majority of the coding. While I'm better at C than Java, I'd rather
not do a project this large in C because I think I'd spend too much time
coding things that are already there for you in Java and debugging memory
leaks. And I just plain don't like C++.

Brian and Shawn D. can correct me if I'm wrong about their planned involvement
in coding.

-- 
Joel

On Wed, Jan 01, 2003 at 05:41:08PM -0500, Joel Loudermilk said:
>=20
> One of the drawbacks of Java is that you can't really depend on a version
> of the JRE to be installed on a target system. But we could simply package
> the software with an acceptable JRE for each platform (which is what
> commercial vendors do with Java and with Perl). It would even be possible
> to make a version of the software that comes without a JRE for folks who
> don't want another one.

I don't like the idea of packaging a JRE, they're huge, but if you don't
you find that Java is not write-once run-anywhere, not even nearly so.
It's beaten by a country mile by perl, python, and probably half a dozen
others I don't know about.

However, with the GNU Compiler, we could produce compiled native Java
applications, on every target platform.  This would actually be easier
than finding a JRE for all the target platforms and keeping it
synchronized when we need new features.

Of course, if you're going to go there, why do Java in the first place,
unless all your coders are best at it?

I vote for whatever language you and Brian are best at coding.  The
whole point to the cross-platform ability of Java is lost until they
define a standard and stick to it, and they are years away from that
presently, if they even get there at all without turning it over to a
standards body.

--=20
Shawn McMahon            | Emacs: It's a nice OS, but to compete with
AIM work: spmcmahonfedex | Linux or Windows it needs a better text
AIM home: smcmahoneiv    | editor. - Alexander Duscheleit

I've been kicking around some ideas about the language to use to write
Clockwork, and while I was resistant at first, I think Java might work well.

Java would take care of much of the portability for us (remember that we
stated in the requirements that Clockwork needs to run on Windows, too). It's
also made easy the tasks of memory management, data structures, network
communication, threading, and even logging (Jakarta's log4j is quite good).
It really seems like Java would let us spend more time writing code that
actually runs the scheduler, without having to stop and first create a
message-passing system, a generalized logging system, and other low-level
stuff.

One of the drawbacks of Java is that you can't really depend on a version
of the JRE to be installed on a target system. But we could simply package
the software with an acceptable JRE for each platform (which is what
commercial vendors do with Java and with Perl). It would even be possible
to make a version of the software that comes without a JRE for folks who
don't want another one.

And there's the foolishness with $CLASSPATH and the fact that Java programs
don't look quite like other programs in 'ps' when they run, but we could
easily wrap them with shell scripts to make startup and shutdown easier.

I really think that if a user is able to pkgadd/apt-get/rpm/swinstall our
software package, edit a couple of configuration files, run
'/etc/init.d/clockwork start' and get the software running, then it doesn't
matter if it's Java and we had to bundle it with a JRE. If install and
setup is easy, users will use it. LimeWire is a great example of this. It
can install its own JRE if you want, and it's almost transparent. You just
run an installer and it works.

The Berkeley DB code has a Java interface, but it uses native code as the
backend. So we'd need to have separate software packages for each platform
for that reason. But I don't think this is a really big deal. Rather than
doing like RedHat's RPM and assume that db >= version _x_ is installed and
in your $LD_LIBRARY_PATH, which forever ties together your db and rpm
software, I'd rather just stick the db libraries for whatever version we've
tested with in a lib/ directory along with the rest of our software.

-- 
Joel

I put down an idea for the number, type, and contents of the required
Berkeley DB databases for Clockwork this weekend, as well as my idea for
possible job states.

Rather than send everything through email, I posted them in the DocManager
on the SourceForge site:

https://sourceforge.net/docman/index.php?group_id=40038

These are by no means final, and I would encourage anyone who's interested
to review them, find problems, and poke fun at me.

-- 
Joel

I'd like to propose a more detailed design for the job scheduler. This is
based on my ideas in the last email, and some more research into Berkeley DB.
In this mail, I'll attempt to explain how the scheduler could be implemented
as a multiple-master using Berkeley DB.

In every collection of systems running a job schedule, there must always
exist at least two "master" nodes. As I mentioned previously, these nodes
don't have to be dedicated systems. Being a master node just means that the
system has some additional responsibilities in the schedule.

The databases representing jobs, events, and such would be Berkeley DB
databases. Exactly one of the master nodes would be responsible for handling
database writes (a requirement of Berkeley DB is that only one system can
make updates). This system could be called the "write master." The other
master nodes have copies of the databases, which they update when they
receive replication messages from the write master. They can make read-only
queries against their data, but they can't update it -- updates must be
made to the write master.

Let me pause for a moment to explain how replication and failover works
using the Berkeley DB API. We are responsible for setting up and maintaining
a communications infrastructure between the write master and the other
masters. We provide a function to the Berkeley DB API that it can call on
the write master to send data to another system. We are in control of who
is the write master, although the API will help us figure it out by
supporting elections. There is a well-defined procedure for adding a new
subscriber system which will get that system caught up on the current state
of the database(s) and will start feeding the system updates. We're also
responsible for making sure writes only happen on the write master. If we
want, the API will guarantee that an update is committed to all replica
databases before returning from the commit. This may be useful, but may make
writes too slow.

Having at least two master nodes -- a write master and one or more additional
masters -- meets our HA requirement. If one goes down, we can promote another
system in the environment to master and, if necessary, hold an election to
determine the write master.

In addition to managing the databases of jobs, the masters are responsible for
running the jobs and deciding when it's time to run a job. Much of this can
be distributed among all the masters, letting them share the burden of this
work.

Suppose that the job schedule defines that at 5:00 PM, 100 jobs are supposed
to run on various systems in the environment. The write master would be
responsible for checking the time and determining that it's time to start
the 100 jobs (I haven't figured out how to distribute that part). He's
occasionally checking the clock to see if time-based jobs should start, and
when he sees that these 100 jobs should start, he sticks 100 events in the
"jobs-to-be-started" queue. This queue is one of the databases, which means
that all the masters have access to it. (One of the database types that
Berkeley DB supports is a queue, with an atomic "eat-from-the-head" operation.)

The jobs-to-be-started queue is seen by all the masters, which are
occasionally checking it for work. When one sees some work, it grabs some
number of jobs off the head of the queue (it must talk to the write master
to do this, since this is not a read-only query). Each master takes a portion
of the jobs to be started, tries to start them by talking to the client
system, and updates the write master again with the status of the job (either
"started" or "failed to start").

When the write master gets updates as to the final status of jobs when they
complete (I haven't covered exactly how that happens, but assume that word
eventually gets back to the write master), the write master sticks the name
of each newly-finished job on the "newly-finished" queue. The purpose of
this queue is to distribute the work of determining whether the completion
of a job means that any other jobs should now be started. All the masters
are checking this queue periodically also, and will grab a chunk of jobs to
process from it.

Determining the successors of jobs efficiently will require that we keep a
database of successors, keyed by completing job. For example, if jobs B
and C should start when job A finishes, looking up job A in the successors
database will return the names of jobs B and C. So in order to process
work from the newly-finished queue, all a master must do is search through
the successors database, determine if the dependency is met (i.e., did job
A finish with failure, but the dependency is success-only), and if it is,
stick more job-start events on the to-be-started queue (by talking to the
write master again). These job-start events will in turn get distributed
among multiple masters for processing.

It might sound like the write master has a far greater burden of the work,
but it doesn't necessarily have to be that way. If we configure the database
replication so that replication is synchronous, we can guarantee that the
all the other masters are as up-to-date as the write master. Then, those
masters can query their own, local copies of the databases to find out
job definitions and successor information. They'll only need to involve the
write master when an update needs to be made.

Sure, there's a performance penalty from making the replication synchronous,
but if all the masters need to be working from the guaranteed-latest data,
then we have to pay that penalty somehow. Either the database layer can
do it for us, or we can do it ourselves by making all the other masters
have to talk to the write master to make even read-only queries to be
sure the data is current. The overhead is probably the same, so I'd rather
let the database do the work.

I haven't fully fleshed-out how the client updates will get back to the
write master, but I was thinking of something like having the clients get a
list of all the masters when a job is started. When it finishes, the client
will try to contact one of the systems in that list to report status.
If that system isn't the write master, it will take care of getting the
update to the write master. If that system is down (or no longer a master),
the client will move on to the next system in the list. As long as the
set of masters doesn't change by 100% while a job runs, then this will work.

And assuming all masters have current job status data, a JobScape-like
GUI could attach to any of them. Being able to distribute this work will
really help out, especially if there are a lot of GUI users.

So the only things that the write master does that the other masters can't
share in is make all the database updates (and we can't split that up),
and check the clock periodically and see if any time-based jobs need to start.
If we keep another database keyed by time, it shouldn't be too much work to
do that task. That's why I consider this somewhere in between the single-master
and distributed-across-all-nodes approaches. I really think this will
scale quite well.

Do any of you have any thoughts on this design? I need to try and shoot holes
in it and see if I spent enough time thinking this up.

-- 
Joel

I don't know how familiar everyone is with Berkeley DB (I'm really not), but
I was just doing some reading this evening about its capabilities and API
and thought I'd share some thoughts with the group.

Apparently, Berkeley DB will support not only transactions, but also
failover and replication. When you combine that with the fact that it's
embedded in your application so there's no need to mess with your
Oracle or MySQL installation, it starts to sound attractive.

Unfortunately, you can't make queries even approaching the complexity of
a SQL query. Every database is just a collection of {key, value} pairs
(although both can be of arbitrary length, and of any format), so your
queries are all either "give me the next record" or "give me the record
whose key is X."

And if you want more than one set of {key, value} pairs, you need to create
a database. I suppose this is why there are about 20 of these in
/var/lib/rpm -- one for each "table." Fortunately, your transactions can
be across databases, and it does log replay on a crash ... the whole bit.

Being able to put _any_ type of data in the database could be very cool.
Imagine if there was a Java object (or a C structure) representing a job
definition, then the database of these job definitions could be accessed
by simply reading and writing the serialized objects (or the C structures).
That would certainly make things easy, and there are already APIs for
C, C++, Java, and Perl.

A specific advantage Berkely DB has over a SQL database for our application
is the replication and failover. It's built-in to the product, whereas
if we went with a SQL database, to make it highly available we'd either
have to assume the user is using a database replication product from his
vendor or roll our own replication (like AutoSys, and I don't think anyone
is excited about the AutoSys homegrown replication).

According to the web site (http://www.sleepycat.com/), the HA Berkeley DB
package will log updates, which are all made to a single master, and
distribute them to other systems. In the event of a failure, one of the
other systems is promoted to a master. I assume this is with minimal hassle
to the application, since it claims this is transparent to the end-user,
but I haven't read any code that uses those features yet.

I'll read some more, but at this point it sounds to me like for all it could
buy us (no external database required, and built-in database failover), I'd
be willing to give up querying the database with SQL.

-- 
Joel

First off, Happy Thanksgiving to everyone on the list! Hopefully you won't
mind some actual traffic on this mailing list (in addition to the monthly
mailman announcement).

I was doing some thinking the past couple of weeks about the clockwork
project, believe it or not, and Shawn McMahon and I had a few minutes to talk
about design the other day. As you may recall, things pretty much stalled out
while we were trying to work out whether or not to make the scheduler
centralized (like AutoSys) or decentralized (like something else we've not
really worked with, but think would be better).

Everyone agrees that AutoSys has some pretty bad bottlenecks, and I think
that's what has made many of us (myself included) want to steer clear of
a centralized design. But as Shawn pointed out to me last week, there's a
good chance that some well-applied multithreading could make AutoSys scale
a whole lot better. I've heard that there's some maximum number of events
per second that can be processed by an event processor, regardless of
how much horsepower you have. To me, this sounds like there's some important
stuff in AutoSys that isn't multithreaded.

The appeal to me of the single-master/centralized design is its simplicity.
The distributed design sounds great, but it also sounds very complex,
possibly requiring us to do multicast notifications and implement a tiny
little publish/subscribe system. A single-master design would make things
simpler to implement.

What are the things we hate about AutoSys' single-master design?
(1) It won't scale past 5,000 jobs. As I said before, I think we can fix that
with multithreading.
(2) It requires a dedicated pair of scheduling machines. We can eliminate
this requirement for small schedules if the event processor is fast enough
and we make configuration easy enough. An administrator could elect to
"promote" a couple of the managed systems to run the event processor.

We could even design the multithreading so that some of the event processing
work could be done not just by another thread on the scheduling server,
but by another scheduling machine altogether. For instance, when you look at
a job in AutoSys that's about to be started, the state of the event is
briefly "PG" for "ProcessinG" while the EP dispatches it and talks to the
client. Imagine if there were multiple machines processing events, and
the status were set to "Processing by machine A." Something as simple
as that could off-load part of the burden of event-processing to multiple
systems.

And if the system were flexible enough to allow the scheduling servers to
be easily set up (unlike AutoSys), they could be easily moved around either
by the administrator or perhaps automatically, based on load averages. Now
we've got a system that behaves as the distributed model, but isn't too much
more complicated than the plain single-master model.

There's also the issue of databases. Our AutoSys administrators don't like
its requirement of a SQL database because then they have to get DBA support.
But a SQL database sure makes some things easier for the programmers. I
looked briefly at SQLite [1], an embedded SQL database engine. It's kind
of neat -- you get SQL queries and even transaction support fully contained
within your application; the database lives in a file on the filesystem.
But it doesn't support object types on columns (any column can hold
anything), and it's unclear how well it holds up under a load of concurrent
users (all its benchmarks are single-user).

There's also Berkeley DB (which I know Shawn McMahon despises), which claims
to support transactions and failover. If this is robust enough and easy
enough to work with, then it might be the answer -- giving the programmers
something that does the work of a database while appearing invisible to the
end user.

Of course, most people have a SQL database running *somewhere*, and is it
really a big deal if we tell them they need to host another database on it,
particularly if we didn't require a specific vendor's database?

I'll spend some more time thinking about exactly what the responsibilities
of the event processor are and trying to find ways to easily distribute them
over a few machines. In the mean time, if you have any thoughts, please
send them to the list.

The bottom line is that I really think we can take the AutoSys EP model,
apply some well-placed multithreading and distributed computing, we'd have
a system that with a simple design and the scalability we want.

[1] http://www.hwaci.com/sw/sqlite

-- 
Joel

Happy Mailman Day, everybody!

--=20
Shawn McMahon                    |        Help spread accurate information
AIM: spmcmahonfedex, smcmahoneiv |about Xenu and the Church of Scientology.
             <a href=3D"http://xenu.net/">Scientology</a> on your web site.

begin  quoting what Joel Loudermilk said on Thu, Mar 28, 2002 at 07:08:16PM=
 -0500:
>=20
> It doesn't seem too difficult to take the AutoSys multi-instance design
> and make the management tool smart enough to talk to all the instances
> behind the scenes, and present the user with one view of *all* the jobs.
> And it should also be possible to devise a way to have dependencies across
> instances without the user needing to specify the instance.

At first glance it seems possible, and it certainly would rule if we
can make it work.

+- On Wednesday (3/13/2002 8:47) Shawn McMahon <smc...@ei...> Wrote-
| Decentralized would have "N" equal to the number of clients, I'd think.
| 
| I'm not opposed to multiple centers, I just think it only works if at
| any given moment there's unambiguous "ownership" of each job.

It occurred to me today that the "multiple centers" design is essentially
what AutoSys is in a multi-instance configuration. But the drawback of
AutoSys' multiple centers is that each has to be managed separately. Another
(less severe, perhaps) limitation is that cross-instance dependencies need
to explicitly specify the instance where the remote job lives. These two
things combined mean you need to not only manage the schedulers separately,
but design them as separate entities as well.

It doesn't seem too difficult to take the AutoSys multi-instance design
and make the management tool smart enough to talk to all the instances
behind the scenes, and present the user with one view of *all* the jobs.
And it should also be possible to devise a way to have dependencies across
instances without the user needing to specify the instance.

Wouldn't that be the best of both worlds? This question is really aimed at
Shawn Dvorak, since he was shooting for the decentralized design.

-- 
Joel

This one time, at band camp, Joel Loudermilk wrote:
>=20
> But it sure would be nice if you could distribute the schedule-processing
> load just by, say, activating a few more nodes as schedulers. Sort of like
> Legato Cluster -- you can promote as many nodes as you want to primary, a=
nd
> they'll just start replicating the configuration among themselves.
>=20
> I'm not opposed to this design, but I'd want to flesh out a plan for how
> the GUI would manage the jobs and how the jobs would be dispatched before
> committing to this approach.

That's still centralized, it's just got N centers.

Decentralized would have "N" equal to the number of clients, I'd think.

I'm not opposed to multiple centers, I just think it only works if at
any given moment there's unambiguous "ownership" of each job.

+- On Tuesday (3/12/2002 20:40) "Shawn Dvorak" <sd...@cf...> Wrote-
| Perhaps we could make a compromise, with
| some number of distributed servers responsible for collecting scheduling
| statuses from a subsets of servers.  These collector servers wouldn't do any
| dispatching; they'd only collect job status broadcasts/unicasts from the

If multiple systems would do the dispatching, then how would we determine
which system would dispatch a given job?

I'm in agreement that it would be great not to have the single-dispatcher
bottleneck, but all the solutions I can think of involve a lot of
constant communication between the scheduler nodes, and it seems like there
would be a fair amount of risk that something would get out of sync. Then
again, even the centralized design would require some sort of failover
replication, so maybe it wouldn't be _that_ much more work to design
a system that has N schedulers instead of 2.

But it sure would be nice if you could distribute the schedule-processing
load just by, say, activating a few more nodes as schedulers. Sort of like
Legato Cluster -- you can promote as many nodes as you want to primary, and
they'll just start replicating the configuration among themselves.

I'm not opposed to this design, but I'd want to flesh out a plan for how
the GUI would manage the jobs and how the jobs would be dispatched before
committing to this approach.

Have you got any more detailed ideas for how to approach this?

-- 
Joel

My preference is for a decentralized scheduler.  I think that the only real
downside is the load involved in getting a complete real-time view of the
entire schedule.  The advantages in removing the bottlenecks caused by a
dedicated server outweigh this.  Perhaps we could make a compromise, with
some number of distributed servers responsible for collecting scheduling
statuses from a subsets of servers.  These collector servers wouldn't do any
dispatching; they'd only collect job status broadcasts/unicasts from the
nodes in the subset.  Then the GUI management tool would only have to poll
these few collector servers to get the complete view.

Shawn

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Monday, March 11, 2002 8:47 PM
Subject: [Clockwork-developers] Decision time

> Shawn D. or Brian,
>
> Do you guys have an opinion on the centralized/decentralized question?
>
> I share Shawn M's feeling that we ought to choose a path and get on with
> the project.
>
> --
> Joel
>
> _______________________________________________
> Clockwork-developers mailing list
> Clo...@li...
> https://lists.sourceforge.net/lists/listinfo/clockwork-developers

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec (6)
2002	Jan (16)	Feb	Mar (10)	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec (4)
2003	Jan (9)	Feb (4)	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec

clockwork-developers Mailing List for Clockwork

clockwork-developers — Discussion about development of Clockwork