From: Thomas C. <cal...@gm...> - 2015-06-01 16:10:46
|
OK, I am investigating further, I have some more hints that it might be related to the Linux PID namespace implementation. I am trying to reproduce the issue outside OCaml/Ocamlnet. I will keep you posted. Thanks On Mon, Jun 1, 2015 at 6:05 PM, Gerd Stolpmann <in...@ge...> wrote: > Just a guess: There is a Unix.getpid call in Netplex_mp. This call > returns the PID in the new PID space, and this PID is different from the > PID returned by fork() (look into the sources of Netplex_mp, where both > is done). I do not remember for what the PID is used, but it is probably > a key in a management data structure. Then (and this is the unverified > part of my guess), some lookup fails that normally cannot fail, and the > controller gets confused. > > Note that you do not see the getpid() calls in the strace because it is > not a real syscall (afaik the kernel just writes the PID into some > memory location after fork/clone, where glibc expects it). > > I don't know whether this is really the problem, but if so, the fix is > probably not trivial. The controller would have to tell the container > via the control socket what the PID from the view of the controller is; > or via the pipe that is used inside Netplex_mp for synchronization. > > The restart_syscall thing you observed is just a poll waiting for an > event. strace just doesn't print it cleanly. > > Gerd > > Am Montag, den 01.06.2015, 13:03 +0200 schrieb Thomas Calderon: > > Hello Gerd, > > > > > > I do not think I reach the limit of the maximum number of processes > > since I have at most 3 defunct processes. > > I would also be likely to see some other message indicating I reached > > this limit (GrSecurity would leave a trace). > > > > > > When attaching to stalled instances, the controller and worker > > instances (except one) are blocked on : > > restart_syscall(<... resuming interrupted call ...> > > > > > > As mentioned, one of the worker process is blocked on: > > futex(0x...., FUTEX_WAIT_PRIVATE, 2, NULL > > > > > > You will find the strace -f as an attachment. > > > > > > > > Cheers, > > > > > > Thomas > > > > On Mon, Jun 1, 2015 at 12:13 PM, Gerd Stolpmann > > <in...@ge...> wrote: > > Am Montag, den 01.06.2015, 11:06 +0200 schrieb Thomas > > Calderon: > > > Hi, > > > > > > > > > We are observing an issue when using OCamlnet netplex in > > combination > > > with VServer PID namespaces. > > > We are using Netplex in the multi-process mode. > > > > > > > > > Here is what we are doing: > > > - start our netplex controller > > > - use the post_add_hook to enter a new PID namespace > > > - use dynamic workload manager to spawn child workers > > > - configured with conn_limit=1 > > > > > > > > > - launch a loop of client connections > > > - this spawns a new worker process for each connection > > > > > > > > > After several successful connections of the loop, clients > > cannot > > > connect anymore. > > > We observe some worker processes in a defunct/zombie state. > > > The controller and running worker processes seem deadlocked > > in some > > > condition. > > > > > > > > > When we do not use the post_add_hook to enter a new PID > > namespace, the > > > problem cannot be triggered anymore. > > > > > > > > > Do you have any hint on this? > > > > > > The controller runs of course waitpid() on the terminated > > processes to > > un-zombie these, and obviously this does not work. I guess you > > reach > > then the maximum number of processes after some time. > > > > You say "vServer" but there are several such technologies > > (Linux > > containers, Virtuozzo, maybe some derived products). I also > > don't know > > much about this corner of the OS. > > > > What would definitely help is an strace -f of the server. > > > > Gerd > > > > > > > > > > > > > > > Many thanks. > > > > > > > > > Thomas > > > > -- > > ------------------------------------------------------------ > > Gerd Stolpmann, Darmstadt, Germany ge...@ge... > > My OCaml site: http://www.camlcity.org > > Contact details: http://www.camlcity.org/contact.html > > Company homepage: http://www.gerd-stolpmann.de > > ------------------------------------------------------------ > > > > > > > > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany ge...@ge... > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > |