Re: [SSI-devel] cluster hang with top of the kernel tree

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

SAMPATHKUMAR KISHORE KANIYAR wrote:
>  > With the top of the tree i am hitting the below condition with error
>  > -11 (EAGAIN).  The cluster hang after that.
>  > 
>  > file: ics/ics_svr_mgmt.c  
>  > 
>  > 876                         if (error && error != preverror) {
>  > 877                                 printk(KERN_WARNING
>  > 878                                        "icssvr_nanny: spawn_daemon_proc"
>  > 879                                        "error=%d, will be retried",
>  > 880                                        error);
>  > 881                                 preverror = error;
>  > 882                         }
>  
>  I was looking through the related code.
>  
>  spawn_daemon_proc() invokes kernel_thread() which invokes __do_fork().
>  In __do_fork(), GET_PID() is failing (with a value less than zero)
>  which results in "-EAGAIN" being returned by __do_fork().
>  
>  In the current implementation, spawn_daemon_proc() will loop forever
>  as long as it still needed to create more ics daemon's. In the current
>  scenario, this seems to be the case due to kernel_thread() failure.
>  
>  I could not find a reason why GET_PID() would fail and return a value
>  less than zero! Probably, someone will be able to answer this.
>  
>  NOTE: I hope CONFIG_VPROC etc are all enabled suitably while configuring
>        the kernel and that this is not a problem due to any oversight
>        related to configuring. Just checking.
>  
>  - Kishore
> 

The initial error report wasn't very informative. Some indication of 
what the cluster is doing or the amount of time required to hit this 
condition would be interesting, because my cluster isn't doing this for 
me at the moment.

There are several ways do_fork() can return EAGAIN. The most likely of 
them would be associated with large numbers of processes. Does kdb's "ps 
A" should large numbers of, perhaps unreaped, processes? What do 
print_icscli and print_icssvr show going on with ICS? There have been 
cases in the past where ICS gets into a loop going back and forth 
between two nodes until things blow up.

John