[Queue-developers] new design details

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

As promised before, here is some more details about the new system I
mentioned before. 

Werner: This is more or less identical to what I sent you previously. 

I envision 4 separate programs working together in this system:

qs: Users use this program (like "queue" or "qsh" in GNU Queue) to
submit jobs [ Presently not implemented at all ]

qm: the queue manager running on some central host. qs sends job
requests to qm. [ ~60% implemented ]

qd: daemon running on slave or "compute" nodes, possibly on the same
host as qm as well. More than one qd may run on any host, there may be
any number of them on any number of hosts. Only qm talks to qd's,
sending jobs as available. The distribution protocol works as a
offer/volunteer system, qm sends offers to multiple qd's at once for the
same job, willing qd's respond will a volunteer. qm assigns the job to
exactly one qd. qd may refuse at this point too (resets job to offer
stage), or commit and receive transfer of the job and begin execution.
Important point is qd's decide autonomously if they can spare resources
for the job. qm has some state information of the availability of the
qd's it knows about and does not send offers to qd's it knows are fully
committed, but qm does not need an accurate perception, it's qd's
decision. [ ~70% implemented ]

qe: Execution agent forked and exec'd by the qd process for running a
job. qe is responsible for setting the environment, calling back to
waiting qs if foreground mode selected (called interactive mode in GNU
Queue I think), validating and changing to the user of the job,
monitoring for the termination of the job and return code, etc. qe is
the only part of this system that needs to be setuid root. qd and qm may
need to start as root to read the system wide key file (see below) but
can drop privilege permanently after that. [ currently a trivial program
which just returns immediately is presently implemented ] 

Some design goals/choices:

NFS is not used for communication and distribution of the jobs. This was
a primary goal in the design for me. After getting into it, I have new
appreciation for the design of GNU Queue though. :)

Stateless UDP is used for communication for qm and qd, which results in
some complexity in the code due to the possibility of lost messages.
This is a design goal as persistent TCP connections consume file
descriptors limiting the number of qd's that can be connected to qm. I
would like this to scale well beyond typical limits for open file
descriptors. 

All messages between qd and qm are cryptographically signed [ this is
already fully implemented ] using key'd SHA-1. On connection, a
registration protocol verifies the authenticity of both qm and qd by
proving knowledge of a system-wide key. After registration, each qd is
assigned a session key used to sign messages after that. qs will
communicate username/password information (encrypted) to qm which will
ultimately be passed from qd to qe which will authenticate before
switching to the user requested. 

Much effort is being put into low latency distribution of jobs.
Experimenting with the version of GNU Queue I have, after making several
changes to get it to go and all, it takes a second or more between
submission of a job and onset of execution in an idle cluster. Much of
this I think is due to built-in deliberate delays to work-around NFS
race conditions, hence my interest in eliminating NFS as a communication
layer between submitting users and execution agents. My present system
is seemingly instantaneous on an idle cluster (but much is not
implemented yet), my goal is to have latency for executing say 1000 no-
op jobs on a system with a single qd agent comparable to that of a shell
script doing the same directly. 

Some drawbacks:

Security rests ultimately with the privacy of the system-wide key file,
which must be installed or accessible to both qm and all qd agents.

All systems running qd must have access to the same authentication
system for validating username/password for submitting users. NIS or
something equivalent is probably the easiest both for me as developer
and administrators at large who might use this thing. We can potentially
use a custom arrangement through PAM too.

NFS or other shared network filesystem still required for user jobs to
read/write input and output, unless they only want to use stdin/stdout
in which case qs can handle it. I don't consider this a problem really
for dedicated systems.

Job transfer takes place over a transient TCP connection, but I've
noticed this can cause a hiccup (qm pauses for several seconds but
eventually resumes rapid distribution of jobs) if the TCP SYN packet is
lost, which seems to happen after about 30,000 jobs have been sent and
executed as fast as possible. the TIME_WAIT state of a closed TCP
connection hogs the system resources on the qm host, potentially
blocking the opening of new connections until resources are available.
This is only a problem in the pathological case of >30,000 no-op jobs at
once, surely not a real-world problem. Presently the system will pause
if the SYN packet is dropped when forming a new connection, and will
wait until both enough TIME_WAIT old TCP connections are cleared and the
SYN retransmit timer expires, at which point connection is established
and distribution commences again.

This system has a central manager, qm, which the present GNU Queue does
not. Failure at qm will cause the whole cluster to stop executing jobs
after their present assignments. This does not happen with GNU Queue,
unless the NFS server goes down. However, when NFS comes back, provided
no corruption to the filesystem, everything continues. My system will
need some crash-recovery complexity for qm. qd's can die and come back
all they like.

Comments are welcome. If you want to peak at the code, reply back to
this list. If there is interest and no objections, I will post a copy of
the source as is to the list. It doesn't do much for the moment except
implement the qm -> qd -> qe chain of events and demonstrate the
distribution of jobs.

Cheers,
Koni

On Sun, 2005-05-01 at 19:38 -0400, Richard Stallman wrote:
>     Anyhow, I suggested in my email to Koni and Mike that
>     we wait a week or two for Mike to respond.
> 
> I think that is a reasonable plan.  The program needs a maintainer who
> will make releases, and more generally, who will give the program
> proper attention.
> 
> 					       At some
>     point, we'd post publically, and then wait about 30
>     days or some reasonable time. 
> 
> I don't understand that part.  Wait 30 days for what?
> 
>     What's the standard procedure for reclaiming an
>     abandoned GNU project?
> 
> I can appoint (and remove) maintainers at any time.
> So once the situation is clear, I can simply appoint
> a new maintainer for GNU Queue.