[Queue-developers] comments on mike's message

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

> Richard, Werner, and Mark:
> 
> My apologies for the issues I have caused.
> 
> When I first took over Queue, I had ample spare time, being unemployed at
> the time.   I had expected to make some quick work of the project.
> 
> But, as I got into it, it turned out to be more work than I anticipated,
> for reasons I'll get into below.
> 
> Then, of course, I found a job.  Between work and raising a family, it
> turned out I was never able to find the time to properly fix the problems.
> Heck, finding time to properly reply to these messages from Werner and
> RMS was just another indication of the problem.
> 
> I could not, as much as Werner requested it, simply do a release with
> just my name as the new maintainer.  Releasing something that wouldn't
> even compile just wasn't right, in my opinion.
> 
> Some work I did get done on Queue is the following:
> 
> Updated to modern autoconf.  I think all of that is taken care of.
> It should work with any autoconf-2.59+.

Excellent. In the past I have had a lot of trouble understanding how to
use the autoconf/automake system as a developer, hopefully I can follow
the work you have done with autoconf and move it forward into a new code
base. 

> 
> It does compile again, but all of the terminal allocation code is now absent
> (that is, you HAVE to use -n now).
> 
> Two main issues and several minor ones (plans really) still exist:
> 
> 1) As I mentioned, no terminal code.  The previous stuff was too outdated
> to work on modern systems.  I could have just borrowed code from a package
> like screen, expect, or script or something.  While screen and expect at
> GPL, I was actually hoping to get something owned by FSF, and use it.
> To that end, a post to one of the GNU lists asking for pointers would
> probably be a good start.

For the new development plans, I think we'll have to start in the same
way, without the terminal capabilities and then add these features once
job distribution, scheduling, and hopefully some management tools for
users (ie, ability to list, delete, or delay jobs) to control the system
are functional.

> 
> 2) The more important issue, I think, is that the protocol, as currently
> implemented, is subject to race conditions.  I can deadlock in less
> than 3 seconds with nothing more complicated than the `date' command.
> This requires a complete overhaul, which is where I got caught up.

I have observed the dead lock problem in the queue-stable branch and
traced its sources to a couple different race conditions. Some of these
can be worked around by inserting delays but guaranteeing correctness is
complex. I agree here that complete overhaul of job distribution is the
preferred approach.

> 
> 3) Starting with the minor issues, or would be nices, would be a migration
> from SF to Savannah.  At one time, when I thought I was going to give
> energy to Queue, I was going to do this migration, then they had the
> security issue in 2003.

I too would like to move away from sourceforge. I need to check out
Savannah though. My main complaint with sourceforge is its too busy and
complicated, and way too much irrelevant stuff on any webpage except for
the "home" pages for the project. There are 4 or 5 different ways to
post something, which seems unnecessary and confusing. For us as
developers, its difficult to kept track of all the different places a
user might post a bug, patch, or comment.

> 
> 4) Slowly rewrite all of GQ to enable a definitive set of authorship,
> to enable safely transferring the code to FSF ownership.  (This was why
> I didn't want to just pull terminal code from expect, or even screen,
> as neither of those are FSF owned either.)

I'm not sure how I feel about FSF ownership of copyright. I'm not
opposed to it, but I not particularly in favor either because I not sure
what it gains for GNU queue, and it may restrict what I might want to do
in the future with my own code. I asked RMS about some details, he
simply indicated that it is not required for FSF to have ownership for
GNU queue. I guess we can think about it more when we have a new code
base out and distributed under GPL with a known author list. 

> 
> 5) I'd had some grand ideas about rewriting both the config files and
> communication protocols using some sort of XML structure.  I'm less
> convinced of that now.  But the current set of configuration items are
> too system specific, and every time I see a double go across the wire,
> I wince.  I really think that there should be more emphasis on heterogenous
> environments, including configurations shared by multiple architectures.

For the new stuff I've already defined config formats as a CFG with a
flex/bison parser, likewise for transfer of job information to execution
agents. I don't know XML and am loath to learn it (shame on me) as the
syntax of the crap is just so damn ugly. See below about heterogeneous
environments.

> 
> 6) I'd also thought it's be cool to have some sort of library suitable for
> use with linking into GNU Make for remote processing when using `make -j'.
> I now think this can be accomplished by using SH=/some/wrapper/if/not/qsh
> instead.  I may be wrong though.

distcc covers this territory. As for queue, if the cluster is already
busy at all nodes, compilation would end up submitting jobs that wait in
the queue. I'm not sure if we want to deal with complexity of "if
resources available immediately, distribute job, else run locally", but
it would be a cool feature.

> 
> In looking over Mark's proposals, some of this may be addressed soon already,
> particularly the protocol race-condition issue.  At one point the question
> was raised on whether or not any code from Queue could be reused or not
> to implement some of his ideas.  My gut reaction is probably not.  Ideas,
> sure.  But probably not any code.  Not to implement what he had in mind.
> A re-write from scratch would probably be easier than trying to retrofit
> some those ideas on top of the current code base.  Well, reusing some of
> the autotools stuff should work.
> 
> I would like to emphasize heterogeneity again.  Once thing that I read in
> Mark's proposal was a seeming focus on Linux-kernel based systems.  Or at
> least homogenous environments.  I strongly feel this is a big mistake.
> All the world is not a VAX.  Let's not continue to relearn this lesson.
> 

By homogeneous systems I mean a dedicated collection of systems with the
same architecture and operating system. I think this is the most common
setup that potential users of queue will have, based on my own
experience and what I see others around here doing. I would like queue
to run on any such environment, not specifically linux/x86 setups. The
lab down the hall has a homogeneous 100 node G5 cluster with Mac OS X,
for instance. They use Sun GRID. Cornell theory center has several
windows based clusters, they use their own queuing system. I'm not sure
how many new purchases of solaris clusters or other unix systems there
will be given that linux is highly competitive price/performance wise
for anyone considering making the investment and wants a unix
environment. Anyone buying Sun would probably use Sun GRID I suppose.
Anyway, I would like queue to be a viable option no matter what the
hardware or operating system is.

That said, I only have access really to Linux/x86 based clusters here,
so I can only develop/test/deploy with that. I hope that volunteers from
the community will help test/adapt to other environments. 

As for mixed setups, say a old cluster on one architecture, and a new
one on a different architecture, wearing the sys admin hat I would just
opt for running two GNU queue installations and keeping them separate.
Otherwise, a single queue installation can control both clusters and
require the user to specify the architecture/environment required to run
the job, but this ultimately passes responsibility to the user to
understand that the same program on different systems, even if proper
binaries are installed locally on those systems, may not produce output
readable on other architectures due to binary formats of integers,
floating points, etc. Some groups here develop in Java though and want
to farm out those jobs to get better throughput. HPC with Java doesn't
make much sense to me (like trying to haul a big load using an army of
snails), but that kind of application could farm out to any system.

For the near future development, I want to first focus on the simpler
problem of homogeneous environments and do this well, build an active
user base and development team, then broaden the range of application
for GNU queue to include more complicated setups.

> Werner, if you've not yet done so, please go ahead and remove me from the
> SF project.  I don't for see having any time to even participate in a role
> as being able to compile+test, much less contribute any code.
> 
> Again, my apologies.  It shouldn't have required an external event like
> this to kick start the process.
> 
> Good luck, Mark!
> 
> Cheers, mrc