Re: [SSI-devel] cluster hang with top of the kernel tree
Brought to you by:
brucewalker,
rogertsang
From: John B. <joh...@hp...> - 2005-04-20 19:15:59
|
SAMPATHKUMAR KISHORE KANIYAR wrote: > > With the top of the tree i am hitting the below condition with error > > -11 (EAGAIN). The cluster hang after that. > > > > file: ics/ics_svr_mgmt.c > > > > 876 if (error && error != preverror) { > > 877 printk(KERN_WARNING > > 878 "icssvr_nanny: spawn_daemon_proc" > > 879 "error=%d, will be retried", > > 880 error); > > 881 preverror = error; > > 882 } > > I was looking through the related code. > > spawn_daemon_proc() invokes kernel_thread() which invokes __do_fork(). > In __do_fork(), GET_PID() is failing (with a value less than zero) > which results in "-EAGAIN" being returned by __do_fork(). > > In the current implementation, spawn_daemon_proc() will loop forever > as long as it still needed to create more ics daemon's. In the current > scenario, this seems to be the case due to kernel_thread() failure. > > I could not find a reason why GET_PID() would fail and return a value > less than zero! Probably, someone will be able to answer this. > > NOTE: I hope CONFIG_VPROC etc are all enabled suitably while configuring > the kernel and that this is not a problem due to any oversight > related to configuring. Just checking. > > - Kishore > The initial error report wasn't very informative. Some indication of what the cluster is doing or the amount of time required to hit this condition would be interesting, because my cluster isn't doing this for me at the moment. There are several ways do_fork() can return EAGAIN. The most likely of them would be associated with large numbers of processes. Does kdb's "ps A" should large numbers of, perhaps unreaped, processes? What do print_icscli and print_icssvr show going on with ICS? There have been cases in the past where ICS gets into a loop going back and forth between two nodes until things blow up. John |