[Queue-developers] new design details
Brought to you by:
wkrebs
From: Koni <mh...@co...> - 2005-05-09 21:45:35
|
As promised before, here is some more details about the new system I mentioned before. Werner: This is more or less identical to what I sent you previously. I envision 4 separate programs working together in this system: qs: Users use this program (like "queue" or "qsh" in GNU Queue) to submit jobs [ Presently not implemented at all ] qm: the queue manager running on some central host. qs sends job requests to qm. [ ~60% implemented ] qd: daemon running on slave or "compute" nodes, possibly on the same host as qm as well. More than one qd may run on any host, there may be any number of them on any number of hosts. Only qm talks to qd's, sending jobs as available. The distribution protocol works as a offer/volunteer system, qm sends offers to multiple qd's at once for the same job, willing qd's respond will a volunteer. qm assigns the job to exactly one qd. qd may refuse at this point too (resets job to offer stage), or commit and receive transfer of the job and begin execution. Important point is qd's decide autonomously if they can spare resources for the job. qm has some state information of the availability of the qd's it knows about and does not send offers to qd's it knows are fully committed, but qm does not need an accurate perception, it's qd's decision. [ ~70% implemented ] qe: Execution agent forked and exec'd by the qd process for running a job. qe is responsible for setting the environment, calling back to waiting qs if foreground mode selected (called interactive mode in GNU Queue I think), validating and changing to the user of the job, monitoring for the termination of the job and return code, etc. qe is the only part of this system that needs to be setuid root. qd and qm may need to start as root to read the system wide key file (see below) but can drop privilege permanently after that. [ currently a trivial program which just returns immediately is presently implemented ] Some design goals/choices: NFS is not used for communication and distribution of the jobs. This was a primary goal in the design for me. After getting into it, I have new appreciation for the design of GNU Queue though. :) Stateless UDP is used for communication for qm and qd, which results in some complexity in the code due to the possibility of lost messages. This is a design goal as persistent TCP connections consume file descriptors limiting the number of qd's that can be connected to qm. I would like this to scale well beyond typical limits for open file descriptors. All messages between qd and qm are cryptographically signed [ this is already fully implemented ] using key'd SHA-1. On connection, a registration protocol verifies the authenticity of both qm and qd by proving knowledge of a system-wide key. After registration, each qd is assigned a session key used to sign messages after that. qs will communicate username/password information (encrypted) to qm which will ultimately be passed from qd to qe which will authenticate before switching to the user requested. Much effort is being put into low latency distribution of jobs. Experimenting with the version of GNU Queue I have, after making several changes to get it to go and all, it takes a second or more between submission of a job and onset of execution in an idle cluster. Much of this I think is due to built-in deliberate delays to work-around NFS race conditions, hence my interest in eliminating NFS as a communication layer between submitting users and execution agents. My present system is seemingly instantaneous on an idle cluster (but much is not implemented yet), my goal is to have latency for executing say 1000 no- op jobs on a system with a single qd agent comparable to that of a shell script doing the same directly. Some drawbacks: Security rests ultimately with the privacy of the system-wide key file, which must be installed or accessible to both qm and all qd agents. All systems running qd must have access to the same authentication system for validating username/password for submitting users. NIS or something equivalent is probably the easiest both for me as developer and administrators at large who might use this thing. We can potentially use a custom arrangement through PAM too. NFS or other shared network filesystem still required for user jobs to read/write input and output, unless they only want to use stdin/stdout in which case qs can handle it. I don't consider this a problem really for dedicated systems. Job transfer takes place over a transient TCP connection, but I've noticed this can cause a hiccup (qm pauses for several seconds but eventually resumes rapid distribution of jobs) if the TCP SYN packet is lost, which seems to happen after about 30,000 jobs have been sent and executed as fast as possible. the TIME_WAIT state of a closed TCP connection hogs the system resources on the qm host, potentially blocking the opening of new connections until resources are available. This is only a problem in the pathological case of >30,000 no-op jobs at once, surely not a real-world problem. Presently the system will pause if the SYN packet is dropped when forming a new connection, and will wait until both enough TIME_WAIT old TCP connections are cleared and the SYN retransmit timer expires, at which point connection is established and distribution commences again. This system has a central manager, qm, which the present GNU Queue does not. Failure at qm will cause the whole cluster to stop executing jobs after their present assignments. This does not happen with GNU Queue, unless the NFS server goes down. However, when NFS comes back, provided no corruption to the filesystem, everything continues. My system will need some crash-recovery complexity for qm. qd's can die and come back all they like. Comments are welcome. If you want to peak at the code, reply back to this list. If there is interest and no objections, I will post a copy of the source as is to the list. It doesn't do much for the moment except implement the qm -> qd -> qe chain of events and demonstrate the distribution of jobs. Cheers, Koni On Sun, 2005-05-01 at 19:38 -0400, Richard Stallman wrote: > Anyhow, I suggested in my email to Koni and Mike that > we wait a week or two for Mike to respond. > > I think that is a reasonable plan. The program needs a maintainer who > will make releases, and more generally, who will give the program > proper attention. > > At some > point, we'd post publically, and then wait about 30 > days or some reasonable time. > > I don't understand that part. Wait 30 days for what? > > What's the standard procedure for reclaiming an > abandoned GNU project? > > I can appoint (and remove) maintainers at any time. > So once the situation is clear, I can simply appoint > a new maintainer for GNU Queue. |