From: Gordan B. <go...@bo...> - 2009-12-28 01:02:39
|
Gordan Bobic wrote: > Marc Grimme wrote: > >>> And just to confirm, I used the same binary on another machine >>> (standalone, no OSR or clustering), and it works exactly as expected >>> (prints out what processes it is killing). That means that whatever >>> causes killall5 to go away and never return is specific to glfs+OSR >>> (since killall5 works fine on my gfs+OSR clusters). I'm not sure where >>> >>> to even begin debugging this, though, so any ideas would be welcome. > > >> You might want to try to start it with strace. I recall something that >> under some environments the browsing through /proc which is done by >> killall5 freezes. And I think this is done before killing. Somehow what >> does not work is a stat call on some /proc files within /proc/<pid>. I >> don't recall exactly but I have something like this in mind. >> >> If you have found the pid that causes the problem perhaps we get some >> new ideas on how to handle this behaviour. > > OK, I have straced killall5, and the last few things it does is stat > /proc/version (twice, it seems) and set up SIGTERM, SIGSTOP and SIGKILL > signals. This appears to correspond to lines 682-692 in killall5.c: > > mount_proc(); > ... > signal(SIGTERM, SIG_IGN); > signal(SIGSTOP, SIG_IGN); > signal(SIGKILL, SIG_IGN); > > The last thing strace reports is: > > kill(4294967295,SIGSTOP > > (note - no closing bracket) > > which seems to correspond to line 695: > > if (TEST == 0) kill(-1, SIGSTOP); > > Reading what "man 2 kill" says: > POSIX.1-2001 requires that kill(-1,sig) send sig to all processes that > the current process may send signals to, except possibly for some > implementation-defined system processes. > > I have a suspicion that this may well be the cause of the problems. > killall5 doesn't iterate through all the processes to kill! According to > this, sending "kill(-1, <signal>)" sends the signal to all the processes > that we have permissions to terminate without explicitly specifying the > processes to terminate! Since killall5 is running as root at this point, > this means all processes, with the possible exception of "some > implementation-defined system processes". Right now my bet would be on > this killing glusterfsd (which is in fact running in userspace, and thus > is extremely unlikely to be exempt). > > This brings up another issue - it sounds like the -x option may be > ineffective, too, even on the normal GFS related processes. If the > signals get sent to all processes, then this would include the the > processes specified by -x, regardless. This leads me to suspect that > unless these processes are explicitly excluded in the kernel > implementation, they are not spared the killing at this stage. Looking > at the ps output - fenced, groupd, aisexec and ccsd, for example, don't > show up in square brackets, which implies they aren't running in kernel > space (although that isn't really definitive, only indicative, AFAIK). > So, this may be affected by the bug, too - but this may not be obvious > because once they die, the node will get fenced by the other nodes, > which will end up doing something similar. Or maybe these processes > simply catch and ignore the signals if they are being used (e.g. if gfs > is mounted), or something like that. Anyway, that is just hypothesis at > this point, but it's probably worth checking if you have a suitable test > environment handy (I don't have a non-production gfs cluster handy at > the moment). > > Anyway, I'm going to comment out line 695 and see how that goes. In > theory, this seems superfluous anyway, since the iteration through /proc > for processes to kill should catch everything anyway, and in fact, it is > this iteration that -x relies on for it's functionality! Otherwise > kill(-1) will just blow everything away and preempt anything -x might do > in the first place! > > Am I missing something obvious here? Is there a flaw in my analysis? Sorry, small ammendment - line 695 only sends SIGSTOP. Since it resumes the processes afterwards, this may not affect all processes, e.g. those required by gfs. But if it sends a stop to glusterfsd, it's almost certain that rootfs will in fact block, so it is definitely an issue for that. Since SIGSTOP cannot be caught or ignored by the process itself, killall5 will have to be explicitly modified to do this differently, e.g. using a double-pass through /proc, specifically without including glusterfsd in the list of processes to signal. Gordan |