[Queue-developers] Re: questions about Queue
Brought to you by:
wkrebs
From: W. G. K. <wer...@ya...> - 2000-08-18 22:23:24
|
Probably the best way to get answers to questions like these is to write to the Queue developers at que...@li... (you must subscribe first because that's how the spam-proofing works) or the new (currently much smaller) support list for the anticipated higher volume support traffic at que...@li... . Boris Lorbeer wrote: > Hi, > my name is Boris Lorbeer and I am working for a company which is looking > for a good load sharing tool > which is open source and not as expensive as LSF. > So we came across the Queue program and after spending some time with it > I have a couple of > questions. > Again, Open Source software works because the costs (often chiefly time) for developing and maintaining useful programs are distributed over many organizations via the Internet. Hence, if you have thoughtful questions or complaints (asside from security bugs, if there are any) it is important that these be posted to one of the lists so that maximal resources can be brought to bear on the problem. A key issue is dividing up the lists (currently, there are three, along with the discussions forums on Sourceforge) so that the signal-to-noise ratio of each list is minimized and the right people are getting the right list traffic. First of all, when I installed Queue as root, it did not acknowledge my > settings for the option > --sharedstatedir to the configure program, but it always chose the > default which was > /usr/local/var/queue (not to /usr/local/com/queue which is what the docu > was claiming). The ./configure tool is generated by autoconf and should work; see ./configure --help for usage. --sharedstatedir and the other directories should be recognized. I think the docs are out of date, however. "--sharedstatedir" is indeed "com/queue", but after 1.20.1 there no longer is a sharedstatedir, so stuff gets stored in "var/queue" which is "--localstatedir" The manuals should be updated. They are currently FAQ-O-MATIC HTMLs http://bioinfo.mbb.yale.edu/fom/cache/1.html , so can be updated by anyone, but eventually I should put them on Sourceforge so that our developers can update them. > > Next, when I changed the entries in the profile file and wanted the tool > to read it, the way proposed by > the docu (kill -HUP) did not work; this signal was always terminating > the program. The docu says: > "queued will also periodically check for modifications to these files". > What means 'periodically'? Sign. Again, the docs are out-of-date, especially after 1.20 came out. "Periodically" used to be every 120 seconds, although this has changed. The program checks for modifications more frequently when there is activity in the batch queue. kill -ALRM to queued should work, however. If it doesn't see your changes within 120 seconds or after "kill -ALRM" something is wrong. > > For testing I chose 3 linux machines. The first setting for maxexec was > 2 which caused the daemons and > the client queue to hang whenever I started more than say 5 programs at > once. 3*2 = 6, so 6 jobs throughout the cluster. If you run five, there's only one free spot in the cluster. You're probably using 1.20.1 and seeing the deadlock bug; upgrade to the version in the CVS archive (unpack the installation, cd the top-level directory, and type "cvs update" with CVS installed or get cvs off of ftp.gnu.org.) Sounds like it is time for 1.20.2 to come out. > > When I increased maxexec to 20 these problems were gone, even when I was > starting 40 processes at once: > repeat 40 queue -n -q -d test_queue -r -- /usr/local/bin/perl -e > '$SIG{ALRM} = sub {print "finished\n"; exit;}; alarm(20); while(1){};' 20*3 = 60 job slots, only 40 jobs, 20 spare jobs; 20/3 suggests high probability of find a job on any machine. Deadlock bug not triggered. > > But now I got other problems: First I should say that I have closed one > host in this queue so that there > were only 2 left. For these 40 jobs I received 80 mails : each of these > 40 jobs appeared twice. One mail > was send from running the job on the one host and one mail was send from > running the same job > (or rather a job with the same job id) on the > other host. One of the mails had a proper looking message but the other > one always complained: > Can't stat output file test_queue/ofm433421925 > (probably because his twin finished first). > > Furthermore, the documentation is not very clear about the variables > vmaxexec, nexec, maxexec. > One time it was: > 1-min load average/ (( max(0, vmaxexec - nexec) + 1)*pfactor) > and in another piece of the docu I read: > 1-min load average/ (( max(0, vmaxexec - maxexec) + 1)*pfactor). > Is maxexec the maximum number of jobs sent to one host from this queue > or is it the maximum number > of the jobs in this queue (i.e. summed up over all the hosts)? Time to update and upgrade the documentation (Volunteers? It's possible to set up the docs so that can be edited by developers on sourceforge. Currently, they can be edited by anyone on the FAQ-O-MATIC page, http://bioinfo.mbb.yale.edu/fom/cache/1.html . If you see something you think should be changed, feel free to go ahead and change it without asking. I get emails of all changes that are made, anyway.) nexec is the "C" language variable used in the program, maxexec is the same variable as named in "profile" files. The documentation should use "maxexec" for consistancy and clarity. Maxexec and vmaxexec are per node. It has been suggested that GNU Queue use a central database to establish variables over the entire network, but no one has implemented this, so far. > > > The docu also says: > 'These options, if present, will only override the user's values (via > queue) for these > limits if they are lower than what the user has set (or larger in the > case of nice).' > (this was related to options in the profile file like 'maxfree' and so > on) > How can I specify these options via the client 'queue'? These limits involve operating system resource limits (such as "nice", which specifies the priority a program should be run at.) These can be specified both by the user (via the "limit" command in csh scripts, e.g., limit the size of coredumps to zero to turn off coredumping) and in "profile" as well. The lower value is chosen, so an administrator could turn off coredumping for a specific batch queue even if the user allows coredumping. maxexec and vmaxexec cannot be set by the user in current versions of GNU Queue. > > > I saw in queue.c an option 'j'. What is it used for? You can attach an optional comment to your jobs. This may be used in email output, although it is really a vestige of very old days. > > I would really appreciate it if you could help me with these problems. > Thank you very much. > Boris Lorbeer. |