Thread: [Queue-developers] Re: Re: Re: new intermediate development Queue version
Brought to you by:
wkrebs
From: QingLong <qin...@Bo...> - 2001-03-06 12:33:19
|
On Mon, Mar 05, 2001 at 09:35:12AM +0100, Gert Van den Eynde wrote: > On Sat, 3 Mar 2001 05:34:27 +0300, QingLong said: >> >> I've made some changes to getrldavg() code that may influence >> the misbehaviour you have reported recently. Please try updated code. > > Updated queue and queued and did the same tests as last week. > Queue still locks up (or continuously keeps on trying) to get the load > on the machines. > > Queued gives this as 'error' output: > > qlib.c Queue_net_connect(): connect()ing to 192.168.1.2:1423 ... > qlib.c Queue_net_connect(): connect()ed to 192.168.1.2:1423 on socket 7. > qlib.c Queue_nonblocking_rw(): failed to select() on fd 7: > select(): Interrupted system call ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > qlib.c Queue_net_rw(): failed to get 1 4-byte items on fd 7; got 0 bytes. > wakeup.c getrldavg(): failed to fread() from fd 7. > wakeup.c getrldavg(): close(7). > wakeup.c getrldavg(): ### failed to get load from dirac > ### returning 1.00e+08 as rejection designator. > qlib.c Queue_net_connect(): connect()ing to 192.168.1.3:1423 ... > qlib.c Queue_net_connect(): connect()ed to 192.168.1.3:1423 on socket 7. > I suspect I know what's the matter. AFAIR, you have a short sleeptime (2 seconds?), do you? Please perform a small test: try to run it with default value of 120s. Does it change anything? If it does, then the problem matter is alarms (used to schedule jobs) interrupting select() on netowrk socket. I am going to put some work around scheduled alarms in network io code --- it will become unnecessary if we get rid of streams on network sockets (and using alarm() to timeout reading/writeing those streams) and use select() on bind()en listen()ed socket (put in non-blocking mode) to multiplex tasks of scheduling jobs and accepting network connections. QingLong. |
From: Gert V. d. E. <gvd...@sc...> - 2001-03-06 13:09:19
|
Hi QingLong, > I suspect I know what's the matter. > AFAIR, you have a short sleeptime (2 seconds?), do you? Yes... > Please perform a small test: try to run it with default value of 120s. Just did this (started queued with options --foreground --debug) > Does it change anything? If it does, then the problem matter is alarms I'm afraid not: qlib.c Queue_net_connect(): connect()ing to 192.168.1.1:1423 ... qlib.c Queue_net_connect(): connect()ed to 192.168.1.1:1423 on socket 7. wakeup.c getrldavg(): close(7). wakeup.c getrldavg(): fermi returned load 1.15e+00. qlib.c Queue_net_connect(): connect()ing to 192.168.1.3:1423 ... qlib.c Queue_net_connect(): connect()ed to 192.168.1.3:1423 on socket 7. wakeup.c getrldavg(): close(7). wakeup.c getrldavg(): bohr returned load 1.08e+00. qlib.c Queue_net_connect(): connect()ing to 192.168.1.4:1423 ... qlib.c Queue_net_connect(): connect()ed to 192.168.1.4:1423 on socket 7. qlib.c Queue_nonblocking_rw(): failed to select() on fd 7: select(): Interrupted system call qlib.c Queue_net_rw(): failed to get 1 4-byte items on fd 7; got 0 bytes. wakeup.c getrldavg(): failed to fread() from fd 7. wakeup.c getrldavg(): close(7). wakeup.c getrldavg(): ### failed to get load from wigner ### returning 1.00e+08 as rejection designator. Hope to hear from you soon Gert |
From: Gert V. d. E. <gvd...@sc...> - 2001-03-06 13:45:26
|
Hi QingLong, I have even a more strange problem... (Yes, that's possible :-) ) If I start queued simply with --foreground (so no --debug), the now queue doesn't accept any jobs. I get this as output from queue -i -v -w -p -h fermi -- hostname Requesting load average for queue "now" on host "fermi"... The host "fermi" is not able to serve queue "now". Failed to submit job in queue "now" to host "fermi". If I start queued with --foreground --debug, I get from queue -i -v -w -p -h fermi -- hostname Requesting load average for queue "now" on host "fermi"... Host "fermi" appears to be able to serve queue "now". Ok, connecting to QueueD at it. Trying "fermi"... Going to submit job to queue "now" on host "fermi". queue.c: main(): tty(in/out/err): 1 1 1. fermi Gert |
From: QingLong <qin...@Bo...> - 2001-03-06 16:36:32
|
On Tue, Mar 06, 2001 at 02:25:40PM +0100, Gert Van den Eynde wrote: > > I have even a more strange problem... (Yes, that's possible :-) ) > > If I start queued simply with --foreground (so no --debug), > the now queue doesn't accept any jobs. > I get this as output from queue -i -v -w -p -h fermi -- hostname > > Requesting load average for queue "now" on host "fermi"... > The host "fermi" is not able to serve queue "now". > Failed to submit job in queue "now" to host "fermi". > This is not strange, look in queued main() (queued.c around line 900): | | /* | * Go to sleep for a while before flooding the system with | * jobs, in case it crashes again right away, or the | * system manager wants to prevent jobs from running. | * Send a SIGALRM to give it a kick-start. | */ | | if (!debug) { | alarm(sleeptime); | | /* WGK: Rather than do a sigpause(), here, we do a check_query | here, which will cause us to wake up immediately if someone | submits a new job in the first few minutes. This could cause | the batchd to flood the system with new jobs in the event of an | immediate query, but is unlikely to cause any real problems.*/ | | check_query(); | | (void) alarm(0); | } | One have to wait for sleeptime seconds after starting queued in non-debug mode until it will begin accepting jobs. Haven't I already pointed this out here? > > If I start queued with --foreground --debug, > I get from queue -i -v -w -p -h fermi -- hostname > > Requesting load average for queue "now" on host "fermi"... > Host "fermi" appears to be able to serve queue "now". > Ok, connecting to QueueD at it. > Trying "fermi"... > Going to submit job to queue "now" on host "fermi". > queue.c: main(): tty(in/out/err): 1 1 1. > fermi > Isn't thisexpected output? What is the problem here? QingLong. |
From: W. G. K. <wer...@ya...> - 2001-03-06 22:11:50
|
Perhaps we should change this behavior. (It's actually very old and somewhat historical by now.) It's still a good idea not to start flooding the system with jobs after a crash, but if the queue is empty jobs should be processed immediately. Jobs sitting in the queue could be started slowly, one after another, until a certain grace period expires. QingLong wrote: > On Tue, Mar 06, 2001 at 02:25:40PM +0100, Gert Van den Eynde wrote: > > > > I have even a more strange problem... (Yes, that's possible :-) ) > > > > If I start queued simply with --foreground (so no --debug), > > the now queue doesn't accept any jobs. > > I get this as output from queue -i -v -w -p -h fermi -- hostname > > > > Requesting load average for queue "now" on host "fermi"... > > The host "fermi" is not able to serve queue "now". > > Failed to submit job in queue "now" to host "fermi". > > > This is not strange, look in queued main() (queued.c around line 900): > | > | /* > | * Go to sleep for a while before flooding the system with > | * jobs, in case it crashes again right away, or the > | * system manager wants to prevent jobs from running. > | * Send a SIGALRM to give it a kick-start. > | */ > | > | if (!debug) { > | alarm(sleeptime); > | > | /* WGK: Rather than do a sigpause(), here, we do a check_query > | here, which will cause us to wake up immediately if someone > | submits a new job in the first few minutes. This could cause > | the batchd to flood the system with new jobs in the event of an > | immediate query, but is unlikely to cause any real problems.*/ > | > | check_query(); > | > | (void) alarm(0); > | } > | > One have to wait for sleeptime seconds after starting queued in non-debug mode > until it will begin accepting jobs. Haven't I already pointed this out here? > > > > > If I start queued with --foreground --debug, > > I get from queue -i -v -w -p -h fermi -- hostname > > > > Requesting load average for queue "now" on host "fermi"... > > Host "fermi" appears to be able to serve queue "now". > > Ok, connecting to QueueD at it. > > Trying "fermi"... > > Going to submit job to queue "now" on host "fermi". > > queue.c: main(): tty(in/out/err): 1 1 1. > > fermi > > > Isn't thisexpected output? > What is the problem here? > > QingLong. > > _______________________________________________ > Queue-developers mailing list Que...@li... > To unsubscribe, subscribe, or set options: > http://lists.sourceforge.net/lists/listinfo/queue-developers |
From: QingLong <qin...@Bo...> - 2001-03-06 17:14:32
|
On Tue, Mar 06, 2001 at 02:06:39PM +0100, Gert Van den Eynde wrote: > > > Does it change anything? If it does, then the problem matter is alarms > > I'm afraid not: > > qlib.c Queue_net_connect(): connect()ing to 192.168.1.1:1423 ... > qlib.c Queue_net_connect(): connect()ed to 192.168.1.1:1423 on socket 7. > wakeup.c getrldavg(): close(7). > wakeup.c getrldavg(): fermi returned load 1.15e+00. > qlib.c Queue_net_connect(): connect()ing to 192.168.1.3:1423 ... > qlib.c Queue_net_connect(): connect()ed to 192.168.1.3:1423 on socket 7. > wakeup.c getrldavg(): close(7). > wakeup.c getrldavg(): bohr returned load 1.08e+00. > qlib.c Queue_net_connect(): connect()ing to 192.168.1.4:1423 ... > qlib.c Queue_net_connect(): connect()ed to 192.168.1.4:1423 on socket 7. > qlib.c Queue_nonblocking_rw(): failed to select() on fd 7: > select(): Interrupted system call > qlib.c Queue_net_rw(): failed to get 1 4-byte items on fd 7; got 0 bytes. > Maybe, maybe. Nevertheless I added that workaround I have mentioned in previous message as I think that this is right thing to do (here on the not-so-right way). Would you be so kind as to give it a try? Maybe that will reveal something interesting that will help me to get insight of the problem. Thank you. QingLong. |