[Queue-developers] comments on mike's message
Brought to you by:
wkrebs
From: Koni <mh...@co...> - 2005-06-15 14:38:48
|
> Richard, Werner, and Mark: > > My apologies for the issues I have caused. > > When I first took over Queue, I had ample spare time, being unemployed at > the time. I had expected to make some quick work of the project. > > But, as I got into it, it turned out to be more work than I anticipated, > for reasons I'll get into below. > > Then, of course, I found a job. Between work and raising a family, it > turned out I was never able to find the time to properly fix the problems. > Heck, finding time to properly reply to these messages from Werner and > RMS was just another indication of the problem. > > I could not, as much as Werner requested it, simply do a release with > just my name as the new maintainer. Releasing something that wouldn't > even compile just wasn't right, in my opinion. > > Some work I did get done on Queue is the following: > > Updated to modern autoconf. I think all of that is taken care of. > It should work with any autoconf-2.59+. Excellent. In the past I have had a lot of trouble understanding how to use the autoconf/automake system as a developer, hopefully I can follow the work you have done with autoconf and move it forward into a new code base. > > It does compile again, but all of the terminal allocation code is now absent > (that is, you HAVE to use -n now). > > Two main issues and several minor ones (plans really) still exist: > > 1) As I mentioned, no terminal code. The previous stuff was too outdated > to work on modern systems. I could have just borrowed code from a package > like screen, expect, or script or something. While screen and expect at > GPL, I was actually hoping to get something owned by FSF, and use it. > To that end, a post to one of the GNU lists asking for pointers would > probably be a good start. For the new development plans, I think we'll have to start in the same way, without the terminal capabilities and then add these features once job distribution, scheduling, and hopefully some management tools for users (ie, ability to list, delete, or delay jobs) to control the system are functional. > > 2) The more important issue, I think, is that the protocol, as currently > implemented, is subject to race conditions. I can deadlock in less > than 3 seconds with nothing more complicated than the `date' command. > This requires a complete overhaul, which is where I got caught up. I have observed the dead lock problem in the queue-stable branch and traced its sources to a couple different race conditions. Some of these can be worked around by inserting delays but guaranteeing correctness is complex. I agree here that complete overhaul of job distribution is the preferred approach. > > 3) Starting with the minor issues, or would be nices, would be a migration > from SF to Savannah. At one time, when I thought I was going to give > energy to Queue, I was going to do this migration, then they had the > security issue in 2003. I too would like to move away from sourceforge. I need to check out Savannah though. My main complaint with sourceforge is its too busy and complicated, and way too much irrelevant stuff on any webpage except for the "home" pages for the project. There are 4 or 5 different ways to post something, which seems unnecessary and confusing. For us as developers, its difficult to kept track of all the different places a user might post a bug, patch, or comment. > > 4) Slowly rewrite all of GQ to enable a definitive set of authorship, > to enable safely transferring the code to FSF ownership. (This was why > I didn't want to just pull terminal code from expect, or even screen, > as neither of those are FSF owned either.) I'm not sure how I feel about FSF ownership of copyright. I'm not opposed to it, but I not particularly in favor either because I not sure what it gains for GNU queue, and it may restrict what I might want to do in the future with my own code. I asked RMS about some details, he simply indicated that it is not required for FSF to have ownership for GNU queue. I guess we can think about it more when we have a new code base out and distributed under GPL with a known author list. > > 5) I'd had some grand ideas about rewriting both the config files and > communication protocols using some sort of XML structure. I'm less > convinced of that now. But the current set of configuration items are > too system specific, and every time I see a double go across the wire, > I wince. I really think that there should be more emphasis on heterogenous > environments, including configurations shared by multiple architectures. For the new stuff I've already defined config formats as a CFG with a flex/bison parser, likewise for transfer of job information to execution agents. I don't know XML and am loath to learn it (shame on me) as the syntax of the crap is just so damn ugly. See below about heterogeneous environments. > > 6) I'd also thought it's be cool to have some sort of library suitable for > use with linking into GNU Make for remote processing when using `make -j'. > I now think this can be accomplished by using SH=/some/wrapper/if/not/qsh > instead. I may be wrong though. distcc covers this territory. As for queue, if the cluster is already busy at all nodes, compilation would end up submitting jobs that wait in the queue. I'm not sure if we want to deal with complexity of "if resources available immediately, distribute job, else run locally", but it would be a cool feature. > > In looking over Mark's proposals, some of this may be addressed soon already, > particularly the protocol race-condition issue. At one point the question > was raised on whether or not any code from Queue could be reused or not > to implement some of his ideas. My gut reaction is probably not. Ideas, > sure. But probably not any code. Not to implement what he had in mind. > A re-write from scratch would probably be easier than trying to retrofit > some those ideas on top of the current code base. Well, reusing some of > the autotools stuff should work. > > I would like to emphasize heterogeneity again. Once thing that I read in > Mark's proposal was a seeming focus on Linux-kernel based systems. Or at > least homogenous environments. I strongly feel this is a big mistake. > All the world is not a VAX. Let's not continue to relearn this lesson. > By homogeneous systems I mean a dedicated collection of systems with the same architecture and operating system. I think this is the most common setup that potential users of queue will have, based on my own experience and what I see others around here doing. I would like queue to run on any such environment, not specifically linux/x86 setups. The lab down the hall has a homogeneous 100 node G5 cluster with Mac OS X, for instance. They use Sun GRID. Cornell theory center has several windows based clusters, they use their own queuing system. I'm not sure how many new purchases of solaris clusters or other unix systems there will be given that linux is highly competitive price/performance wise for anyone considering making the investment and wants a unix environment. Anyone buying Sun would probably use Sun GRID I suppose. Anyway, I would like queue to be a viable option no matter what the hardware or operating system is. That said, I only have access really to Linux/x86 based clusters here, so I can only develop/test/deploy with that. I hope that volunteers from the community will help test/adapt to other environments. As for mixed setups, say a old cluster on one architecture, and a new one on a different architecture, wearing the sys admin hat I would just opt for running two GNU queue installations and keeping them separate. Otherwise, a single queue installation can control both clusters and require the user to specify the architecture/environment required to run the job, but this ultimately passes responsibility to the user to understand that the same program on different systems, even if proper binaries are installed locally on those systems, may not produce output readable on other architectures due to binary formats of integers, floating points, etc. Some groups here develop in Java though and want to farm out those jobs to get better throughput. HPC with Java doesn't make much sense to me (like trying to haul a big load using an army of snails), but that kind of application could farm out to any system. For the near future development, I want to first focus on the simpler problem of homogeneous environments and do this well, build an active user base and development team, then broaden the range of application for GNU queue to include more complicated setups. > Werner, if you've not yet done so, please go ahead and remove me from the > SF project. I don't for see having any time to even participate in a role > as being able to compile+test, much less contribute any code. > > Again, my apologies. It shouldn't have required an external event like > this to kick start the process. > > Good luck, Mark! > > Cheers, mrc |