[Clockwork-developers] More on security

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Shawn McMahon and I had a detailed discussion on security on Thursday. While
I don't have a handle on everything involved yet, I do think we made some
headway in the area of security. Here are the details of what we discussed.
(Shawn may have to correct me if I misstate anything.)

When an administrator sets up a schedule (or an "instance," to borrow
from AutoSys's terminology), one of the things that will be done is the
creation of a public/private key pair for the schedule. The scheduling server
will keep the private key (secured by appropriate filesystem permissions),
and all the clients (both GUIs and the agent software that runs on the
managed nodes to make them valid targets of jobs) will get the public
key. Shawn explained to me that the keys are symmetric, meaning that you
can either encrypt a message with the public key and decrypt it with the
private key, or you can encrypt with the private key and decrypt with the
public key.

First, how to make GUI clients (and command-line client programs used by
users) secure. As I mentioned before, I don't want to have to store user
and password lists inside Clockwork, and would like to at least give the
administrator the option of authenticating against something else. But this
means that the password from the GUI (or client program, rather, since it
might not be a GUI), will have to be sent to the server. We can't do a
challenge-response kind of thing, since if we're going to pass the password
off to some other backend we need to actually know what it is. So the
communications channel needs to be encrypted so we can safely pass the user's
password along. (I suppose the administrator will need to trust that we're
not secretly recording all the passwords. If he has doubts, he can just
read the source.)

When the client program and the server talk, they use public key encryption
to secretly exchange a "session key" that will be used to encrypt the
rest of their conversation. I suppose they could just use their existing
key pair to encrypt the entire conversation, but Shawn said that's way slower
than "secret key" encryption methods, so most people use the slow public
key encryption to securely exchange a secret key, which they then use to 
encrypt the rest of their conversation.

Now that the client and server have an encrypted communications channel, the
client can prompt the user for his username/password and send it to the server
without fear that it will be snooped or replayed. Some client programs (for
example, a command-line program to send an event or to display job history)
may be more useful if they don't ask for a password interactively, so the
administrator can set up a web page that runs those commands for instance.
In that case, we can make it so that the client programs can take the username
and password from the command line or from a file. This will make things
flexible enough so that the administrator can do whatever he wants with the
client programs.

While we don't want to force the administrator to create a list of
Clockwork-only usernames and passwords, some people might want to do just
that. To make things easy for them, we could make one of our authentication
backends check against an internal database. Berkeley DB provides an easy
way to encrypt a database, too. Other backends I have in mind are PAM (for
the UNIX folks) and calling an external program (for those who want to use
neither PAM nor the internal list). That should satisfy just about everyone.

The other part of security is the communications between the server and the
agents (the piece of Clockwork that runs on every machine that is able to
run jobs). Obviously, the agent must be able to tell that commands are
coming from the real scheduling server, or this would be a huge root exploit.
So when the scheduling server sends commands to the agents, it will
encrypt them using the schedule's private key. If the agent can decrypt the
message using the schedule's public key, then it knows it's a valid
message. I suppose we could make this communication not encrypted, but just
with a digital signature in it, but it seems awful easy to make the whole
conversation encrypted, and then there are no issues with someone snooping
secret data (although it's probably not very important data).

That's it for the server-to-agent connections, but there will also
be agent-to-server connections as well. Just like in AutoSys, if a job
is going to run for hours, we probably don't want to hold the TCP connection
open that long, just so that the agent can say at the end what the job's
exit code was. So the agent will have to initiate a connection back to the
server (or, a server, since there may be multiple servers) to provide the
exit code. In this scenario, the server needs to be able to make sure the
system that just connected to it is a valid agent. So the agent will encrypt
the message with the schedule's public key (which all agents have) and the
server will decrypt it with the schedule's private key. For added security,
we can allow the administrator to set up a list of agent IP addresses/networks.
The incoming connection can be checked against this list, too. For even
more security, the incoming connection can be checked against the IP address
of the system where the job ran. There's little reason that anyone other
than "hosta" should be reporting on the final status of a job that ran on
"hosta." But this level of security might be difficult to manage, since
most hosts have several IP addresses, and the one the server connected to
to get the job started might not be the one all the traffic out of the system
goes across.

What I haven't addressed here are:
(1) Authorization. We've checked to make sure the users are who they say they
are, but how do we know who should be allowed to do what? I was thinking
of a three-level access model: administrators can change things, operators
can only stop/start/hold jobs, and guests can look but not modify anything.
We may need to apply these to something more granular than the entire schedule,
since it may be useful to have a set of users who can only start/stop the
backup jobs, or can only modify a certain set of jobs.

(2) How multiple servers work with authentication. My examples above talk
about multiple agents and one server, but we know there will be multiple
servers. My first thought is for all servers to have a copy of the
private key. I think this would still be secure, and would allow everything
to continue to work as I described.

(3) Ease of maintenance. It needs to be easy for an administrator to change
a system's IP address. It's okay if we have this stored in our databases
somewhere, as long as we provide an easy way to change it when an host's
IP changes. Honestly, though, I'd prefer that we not be on the list of
things an administrator needs to do when he changes IP addresses. Maybe
that level of security could be optional. Also, the administrator needs to
be able to change the schedule's key pair if it's compromised. It would be
nice if this could be done without stopping the servers (outages are bad,
right?). If we make it difficult to change the key pair, the administrator
will be less likely to do it.

Comments, anyone?

-- 
Joel