From: Gerd S. <in...@ge...> - 2015-06-01 16:05:49
|
Just a guess: There is a Unix.getpid call in Netplex_mp. This call returns the PID in the new PID space, and this PID is different from the PID returned by fork() (look into the sources of Netplex_mp, where both is done). I do not remember for what the PID is used, but it is probably a key in a management data structure. Then (and this is the unverified part of my guess), some lookup fails that normally cannot fail, and the controller gets confused. Note that you do not see the getpid() calls in the strace because it is not a real syscall (afaik the kernel just writes the PID into some memory location after fork/clone, where glibc expects it). I don't know whether this is really the problem, but if so, the fix is probably not trivial. The controller would have to tell the container via the control socket what the PID from the view of the controller is; or via the pipe that is used inside Netplex_mp for synchronization. The restart_syscall thing you observed is just a poll waiting for an event. strace just doesn't print it cleanly. Gerd Am Montag, den 01.06.2015, 13:03 +0200 schrieb Thomas Calderon: > Hello Gerd, > > > I do not think I reach the limit of the maximum number of processes > since I have at most 3 defunct processes. > I would also be likely to see some other message indicating I reached > this limit (GrSecurity would leave a trace). > > > When attaching to stalled instances, the controller and worker > instances (except one) are blocked on : > restart_syscall(<... resuming interrupted call ...> > > > As mentioned, one of the worker process is blocked on: > futex(0x...., FUTEX_WAIT_PRIVATE, 2, NULL > > > You will find the strace -f as an attachment. > > > > Cheers, > > > Thomas > > On Mon, Jun 1, 2015 at 12:13 PM, Gerd Stolpmann > <in...@ge...> wrote: > Am Montag, den 01.06.2015, 11:06 +0200 schrieb Thomas > Calderon: > > Hi, > > > > > > We are observing an issue when using OCamlnet netplex in > combination > > with VServer PID namespaces. > > We are using Netplex in the multi-process mode. > > > > > > Here is what we are doing: > > - start our netplex controller > > - use the post_add_hook to enter a new PID namespace > > - use dynamic workload manager to spawn child workers > > - configured with conn_limit=1 > > > > > > - launch a loop of client connections > > - this spawns a new worker process for each connection > > > > > > After several successful connections of the loop, clients > cannot > > connect anymore. > > We observe some worker processes in a defunct/zombie state. > > The controller and running worker processes seem deadlocked > in some > > condition. > > > > > > When we do not use the post_add_hook to enter a new PID > namespace, the > > problem cannot be triggered anymore. > > > > > > Do you have any hint on this? > > > The controller runs of course waitpid() on the terminated > processes to > un-zombie these, and obviously this does not work. I guess you > reach > then the maximum number of processes after some time. > > You say "vServer" but there are several such technologies > (Linux > containers, Virtuozzo, maybe some derived products). I also > don't know > much about this corner of the OS. > > What would definitely help is an strace -f of the server. > > Gerd > > > > > > > > > Many thanks. > > > > > > Thomas > > -- > ------------------------------------------------------------ > Gerd Stolpmann, Darmstadt, Germany ge...@ge... > My OCaml site: http://www.camlcity.org > Contact details: http://www.camlcity.org/contact.html > Company homepage: http://www.gerd-stolpmann.de > ------------------------------------------------------------ > > > -- ------------------------------------------------------------ Gerd Stolpmann, Darmstadt, Germany ge...@ge... My OCaml site: http://www.camlcity.org Contact details: http://www.camlcity.org/contact.html Company homepage: http://www.gerd-stolpmann.de ------------------------------------------------------------ |