Thread: [Queue-developers] some questions
Brought to you by:
wkrebs
From: Gert V. d. E. <gvd...@sc...> - 2001-02-22 11:16:42
|
Dear users/developers of Queue, The last few days I've been experimenting with Queue on a five node Linux cluster (SuSE 7.0 out of the box *without* installing the Queue from SuSE 7.0). I have encountered several problems. I've been browsing the maillist for more information, maybe I missed it... * using the latest release 1.30.1 (after fixing the RLIMIT bug as mentioned in the bugtrack and compiling), all seemed to go well. But... when I submitted a large number of jobs to one queue (exceeding the total sum of different maxexec on the nodes, so a couple of the jobs had to wait for a 'free slot'), I observed that Queue fills up all queues on the machines very nice and puts the other jobs on hold. However, when a job has finished, the waiting jobs keep on waiting. They do not get a free slot and when I query the queuestat files, it says in one that they are running, but nothing is happening on that host. * I also tried to get the latest CVS going (queue-development). Something strange is going on during configuration and compilation. After the usual ./configure --enable-root and make, all seems to be well. When I do make install, it starts reconfiguring (and effectively changing config.h), recompiling and then the compilation breaks due to a missing cleanutent. It seems that during the reconfigure the support for rxvt utmp was added, but the file logging.c is not in the sources list in the makefile. The reconfiguration also changed the install directories for queue. Before, queue queue's were in /usr/local/var/queue, now they were to go in /usr/local/var/spool/queue. The directory for the qhostfile changed from /usr/local/share to /usr/local/share/queue. * I managed to fix the above to have it compile, no probs. When I start this CVS queued without debugging option (on one host, just for testing, the other nodes are down) and I submit jobs to the now queue (the classical hostname job from the manual), I get emails like this: ---- Date: Thu, 22 Feb 2001 11:41:50 +0100 From: The Queue Daemon <ro...@fe...> To: gvd...@fe... Subject: batch queue_b on fermi: queued queued.c sendmail(): SENDMAIL: From: "queued" SENDMAIL: To: "gvdeynde" queued queued.c sendmail(): SENDMAIL: From: "queued" SENDMAIL: To: "gvdeynde" ---- and using the verbose option from queue gives me this: --- Requesting load average for queue "now" on host "fermi"... The host "fermi"is not able to serve queue "now". Failed to submit job in queue "now" to host "fermi". --- However, if I start queued in debug mode (queued --debug), I get this from queue --verbose .... --- Requesting load average for queue "now" on host "fermi"... Host "fermi" appears to be able to serve queue "now". Ok, connecting to QueueD at it. Trying "fermi"... Going to submit job to queue "now" on host "fermi". queue.c: main(): tty(in/out/err): 1 1 1. queued handle.c handle(): going to try to run "hostname". queued handle.c handle(): assembled full path: "/bin/hostname". queued handle.c handle(): going to execve(/bin/hostname). fermi --- My questions: - Is the 1.30.1 release still relevant (is there a patch to fix the apparant hang of queued ?) - Is the developers version reliable (I know it's a developers version, but I am aware of projects where it is best to stick to the developers version than to the stable releases) ? Thank you for your time and a very promising tool. I'm really looking forward to using queue on our system... Gert Van den Eynde SCK-CEN Reactor Physics & Myrrha dept. Neutronics Calculation Section Belgium |
From: W. G. K. <wer...@ya...> - 2001-02-22 13:59:13
|
Quoting Gert Van den Eynde <gvd...@sc...>: > Dear users/developers of Queue, > > The last few days I've been experimenting with Queue on a five node Linux > cluster (SuSE 7.0 out of the box *without* installing the Queue from SuSE 7.0). > I have encountered several problems. I've been browsing the maillist for more > information, maybe I missed it... > > * using the latest release 1.30.1 (after fixing the RLIMIT bug as mentioned in > the bugtrack and compiling), all seemed to go well. But... when I submitted a > large number of jobs to one queue (exceeding the total sum of different maxexec > on the nodes, so a couple of the jobs had to wait for a 'free slot'), I > observed that Queue fills up all queues on the machines very nice and puts the > other jobs on hold. However, when a job has finished, the waiting jobs keep on > waiting. They do not get a free slot and when I query the queuestat files, it > says in one that they are running, but nothing is happening on that host. > > * I also tried to get the latest CVS going (queue-development). Something > strange is going on during configuration and compilation. After the usual > ./configure --enable-root and make, all seems to be well. When I do make > install, it starts reconfiguring (and effectively changing config.h), > recompiling and then the compilation breaks due to a missing cleanutent. It > seems that during the reconfigure the support for rxvt utmp was added, but the > file logging.c is not in the sources list in the makefile. The reconfiguration > also changed the install directories for queue. Before, queue queue's were in > /usr/local/var/queue, now they were to go in /usr/local/var/spool/queue. The > directory for the qhostfile changed from /usr/local/share to > /usr/local/share/queue. > > * I managed to fix the above to have it compile, no probs. When I start this > CVS queued without debugging option (on one host, just for testing, the other > nodes are down) and I submit jobs to the now queue (the classical hostname job > from the manual), I get emails like this: > > ---- > Date: Thu, 22 Feb 2001 11:41:50 +0100 > From: The Queue Daemon <ro...@fe...> > To: gvd...@fe... > Subject: batch queue_b on fermi: queued queued.c sendmail(): SENDMAIL: From: > "queued" SENDMAIL: To: > "gvdeynde" > > queued queued.c sendmail(): > SENDMAIL: From: "queued" > SENDMAIL: To: "gvdeynde" > ---- > > and using the verbose option from queue gives me this: > > --- > Requesting load average for queue "now" on host "fermi"... > The host "fermi"is not able to serve queue "now". > Failed to submit job in queue "now" to host "fermi". > --- > > However, if I start queued in debug mode (queued --debug), I get this from > queue --verbose .... > > --- > Requesting load average for queue "now" on host "fermi"... > Host "fermi" appears to be able to serve queue "now". > Ok, connecting to QueueD at it. > Trying "fermi"... > Going to submit job to queue "now" on host "fermi". > queue.c: main(): tty(in/out/err): 1 1 1. > queued handle.c handle(): going to try to run "hostname". > queued handle.c handle(): assembled full path: "/bin/hostname". > queued handle.c handle(): going to execve(/bin/hostname). > fermi > --- > > > My questions: > > - Is the 1.30.1 release still relevant (is there a patch to fix the apparant > hang of queued ?) This patch has been rolled into the CVS release. > - Is the developers version reliable (I know it's a developers version, but I > am aware of projects where it is best to stick to the developers version than > to the stable releases) ? Unfortunately, our project is in such a phase right now. It is probably easiest to figure out what is wrong with the CVS version and get this working, as opposed to playing with 1.30.1. Once this works a little better, I hope to roll out 1.30.2 from the CVS version anyway, so that 1.30.1 should become obsolete soon. (The other thing you could try would be the pre-1.20.1 releases.) > Thank you for your time and a very promising tool. I'm really looking forward > to using queue on our system... > > Gert Van den Eynde > SCK-CEN > Reactor Physics & Myrrha dept. > Neutronics Calculation Section > Belgium > > > _______________________________________________ > Queue-developers mailing list Que...@li... > To unsubscribe, subscribe, or set options: > http://lists.sourceforge.net/lists/listinfo/queue-developers > |
From: QingLong <qin...@Bo...> - 2001-02-22 14:10:38
|
Hello! I am only working on queue-debelopment branch, so I'll only talk about it. On Thu, Feb 22, 2001 at 12:14:41PM +0100, Gert Van den Eynde wrote: > > * I also tried to get the latest CVS going (queue-development). > Something strange is going on during configuration and compilation. > After the usual ./configure --enable-root and make, all seems to be well. > When I do make install, it starts reconfiguring (and effectively changing > config.h), recompiling and then the compilation breaks due to a missing > cleanutent. It seems that during the reconfigure the support for rxvt utmp > was added, but the file logging.c is not in the sources list in the makefile. > My fault. I've been believing that everyone building queue-development would run aclocal, automake and autoconf beforehand. I've been only committing changes to Makefile.am, configure.in and acconfig.h, skipping all derived stuff. > > * I managed to fix the above to have it compile, no probs. > When I start this CVS queued without debugging option > (on one host, just for testing, the other nodes are down) > and I submit jobs to the now queue (the classical hostname job > from the manual), I get emails like this: [...] > and using the verbose option from queue gives me this: [...] > However, if I start queued in debug mode (queued --debug), I get this from queue --verbose .... > Would you be so kind as to provide exact commands you have issued? > > - Is the developers version reliable (I know it's a developers version, > but I am aware of projects where it is best to stick to the developers > version than to the stable releases) ? > Well, all the changes I've made was to get it work (more reliable). I have managed to make it work rather reliable for me, although, of course, I haven't tried out all possible combinations of command line options, environment etc. Please help me to get queue to reproduce the faulty behaviour, your were talking about. Thank you. QingLong. |
From: Mike C. <da...@ix...> - 2001-02-22 19:35:53
|
On Thu, Feb 22, 2001 at 05:18:34PM +0300, QingLong wrote: > My fault. I've been believing that everyone building queue-development > would run aclocal, automake and autoconf beforehand. I've been only > committing changes to Makefile.am, configure.in and acconfig.h, > skipping all derived stuff. If this is the case, you should probably remove configure from the repository. Either don't put derived files in the repository, or make sure they stay up to date. mrc -- Mike Castle Life is like a clock: You can work constantly da...@ix... and be right all the time, or not work at all www.netcom.com/~dalgoda/ and be right at least twice a day. -- mrc We are all of us living in the shadow of Manhattan. -- Watchmen |
From: W. G. K. <wer...@ya...> - 2001-02-23 01:33:53
|
Mike Castle wrote: > On Thu, Feb 22, 2001 at 05:18:34PM +0300, QingLong wrote: > > My fault. I've been believing that everyone building queue-development > > would run aclocal, automake and autoconf beforehand. I've been only > > committing changes to Makefile.am, configure.in and acconfig.h, > > skipping all derived stuff. > > If this is the case, you should probably remove configure from the > repository. ./configure stays in the repository. It makes my life (and, presumably, everyone else that's using the CVS repository) much simpler. I like to try to keep track of everything that actually goes into a finished release. ./configure is part of that process (and most users don't run autoconf), so there needs to be a history of it. > > > Either don't put derived files in the repository, or make sure they stay up > to date. This is a good rule. It's a good idea to keep ./configure consistent so that testers can pull a copy and get it to work easier. The easier it is for people to test configurations, the more feedback there is, and the better the final result. This is another reason why I like configure in there as well. But, I don't want too much discussion of CVS repository rules on here. I want to make sure developers here focus on what's important: 1. coding 2. documenting changes. It's very important that developers feel this is a supportive environment where they can be creative. Periodically (for legal as well as other reasons) I'll make sure the repository is consistent. > > mrc > -- > Mike Castle Life is like a clock: You can work constantly > da...@ix... and be right all the time, or not work at all > www.netcom.com/~dalgoda/ and be right at least twice a day. -- mrc > We are all of us living in the shadow of Manhattan. -- Watchmen > > _______________________________________________ > Queue-developers mailing list Que...@li... > To unsubscribe, subscribe, or set options: > http://lists.sourceforge.net/lists/listinfo/queue-developers |
From: Gert V. d. E. <gvd...@sc...> - 2001-02-23 07:46:57
|
Hi QingLong, Thank you for the quick response... On Thu, 22 Feb 2001 17:18:34 +0300, QingLong said: [snip] > > > Would you be so kind as to provide exact commands you have issued? root@fermi:~> queued gvdeynde@fermi:~ > queue -v -i -w -p -h fermi -- hostname Requesting load average for queue "now" on host "fermi"... The host "fermi"is not able to serve queue "now". Failed to submit job in queue "now" to host "fermi". root@fermi:~> killall queued root@fermi:~/queue-development > queued --debug --foreground gvdeynde@fermi:~ > queue -v -i -w -p -h fermi -- hostname Requesting load average for queue "now" on host "fermi"... Host "fermi" appears to be able to serve queue "now". Ok, connecting to QueueD at it. Trying "fermi"... Going to submit job to queue "now" on host "fermi". queue.c: main(): tty(in/out/err): 1 1 1. queued handle.c handle(): going to try to run "hostname". queued handle.c handle(): assembled full path: "/bin/hostname". queued handle.c handle(): going to execve(/bin/hostname). fermi and the root window gives this for output of queued: SENDMAIL: To 'gvdeynde' from 'queued': Subject: batch queue_b on fermi: now/CFDIR/cfm694234292: Job is starting now. now/CFDIR/cfm694234292: Job is starting now. Concerning your remark on sleeptime: in my version of queued.c this is at line 799 I believe alarm(sleeptime) where I don't set the sleeptime at the command queued, so it should be at the default of 120 s. If I set it manually to 2 s, with no debug options, and I wait for 2 secs before I do the queue command, it works. If I do it before, same error as before root@fermi:~ > queued -t 2 [After 2 seconds waiting] gvdeynde@fermi:~ > queue -v -i -w -p -h fermi -- hostname Requesting load average for queue "now" on host "fermi"... Host "fermi" appears to be able to serve queue "now". Ok, connecting to QueueD at it. Trying "fermi"... Going to submit job to queue "now" on host "fermi". queue.c: main(): tty(in/out/err): 1 1 1. fermi So in theory, if I would start queued without options I would have to wait 120 seconds before I cannot submit a job. If I do it before that time, I get an error that the queue cannot be served. However, if I put sleeptime manually to 2 seconds, I can start jobs right away (or as good as) but queued wakes up every 2 seconds (I have no idea on the overhead cost of that).... Again, thanks for taking the time to help me out, Gert |
From: QingLong <qin...@Bo...> - 2001-02-22 14:58:30
|
On Thu, Feb 22, 2001 at 12:14:41PM +0100, Gert Van den Eynde wrote: > > and using the verbose option from queue gives me this: > > --- > Requesting load average for queue "now" on host "fermi"... > The host "fermi"is not able to serve queue "now". > Failed to submit job in queue "now" to host "fermi". > --- > > However, if I start queued in debug mode (queued --debug), I get this from queue --verbose .... > > --- > Requesting load average for queue "now" on host "fermi"... > Host "fermi" appears to be able to serve queue "now". > Ok, connecting to QueueD at it. > Trying "fermi"... > Going to submit job to queue "now" on host "fermi". > queue.c: main(): tty(in/out/err): 1 1 1. > queued handle.c handle(): going to try to run "hostname". > queued handle.c handle(): assembled full path: "/bin/hostname". > queued handle.c handle(): going to execve(/bin/hostname). > fermi > --- > I believe, I know what's happening. IMHO, the problem matter is ``having a sleep() before going to job'' code in queued.c (approx line 890) which is skipped in debug mode. It looks like the code isn't elaborate and isn't working as it ought to. I have already pointed this problem out here. The only known solution by now is: wait for a while (to be precise: sleeptime) and queued will wake up and start working. Please, tell me if my guess is correct. QingLong. |