From: Thomas C. <cal...@gm...> - 2015-06-02 12:23:49
|
Hi, After further investigation, I can reproduce it on a standard Linux kernel. Here's the gdb backtrace when the controller is stuck (seems fork() fails at low-level): (gdb) bt #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007f8e37e91eeb in _L_lock_13840 () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f8e37e8ffb8 in __GI___libc_realloc (oldmem=0x15b2260, bytes=574) at malloc.c:3025 #3 0x00007f8e37e7f2db in _IO_vasprintf (result_ptr=0x7fff5a19b2f0, format=<optimized out>, args=args@entry=0x7fff5a19b1c8) at vasprintf.c:84 #4 0x00007f8e37e61657 in ___asprintf (string_ptr=string_ptr@entry =0x7fff5a19b2f0, format=format@entry=0x7f8e37f8d830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n") at asprintf.c:35 #5 0x00007f8e37e3cae2 in __assert_fail_base (fmt=0x7f8e37f8d830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7f8e37f90a38 "({ __typeof (self->tid) __value; if (sizeof (__value) == 1) asm volatile (\"movb %%fs:%P2,%b0\" : \"=q\" (__value) : \"0\" (0), \"i\" (__builtin_offsetof (struct pthread, tid))); else if (sizeof (__value) == "..., file=file@entry=0x7f8e37f90a00 "../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c", line=line@entry=141, function=function@entry=0x7f8e37f8b38d <__PRETTY_FUNCTION__.11207> "__libc_fork") at assert.c:57 #6 0x00007f8e37e3cc32 in __GI___assert_fail ( assertion=0x7f8e37f90a38 "({ __typeof (self->tid) __value; if (sizeof (__value) == 1) asm volatile (\"movb %%fs:%P2,%b0\" : \"=q\" (__value) : \"0\" (0), \"i\" (__builtin_offsetof (struct pthread, tid))); else if (sizeof (__value) == "..., file=0x7f8e37f90a00 "../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c", line=141, function=0x7f8e37f8b38d <__PRETTY_FUNCTION__.11207> "__libc_fork") at assert.c:101 #7 0x00007f8e37ece252 in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:141 #8 0x000000000062b7a6 in unix_fork () #9 0x00000000004b88b4 in camlNetplex_mp__fun_1577 () at netplex_mp.ml:80 #10 0x00000000004d54ef in camlNetplex_controller__fun_3861 () at netplex_controller.ml:359 #11 0x00000000004d5dd8 in camlNetplex_controller__fun_3830 () at netplex_controller.ml:265 #12 0x00000000004c8355 in camlNetplex_workload__fun_2015 () at netplex_workload.ml:332 #13 0x00000000004c8b06 in camlNetplex_workload__fun_1982 () at netplex_workload.ml:230 #14 0x00000000004d445e in camlNetplex_controller__fun_4158 () at netplex_controller.ml:665 #15 0x00000000004d339d in camlNetplex_controller__fun_4275 () at netplex_controller.ml:896 #16 0x000000000050fbdc in camlRpc_server__protect_1582 () at rpc_server.ml:504 #17 0x000000000050fbdc in camlRpc_server__protect_1582 () at rpc_server.ml:504 #18 0x00000000005144fa in camlRpc_server__handle_incoming_message_1710 () at rpc_server.ml:889 #19 0x00000000005392ba in camlUq_multiplex__anyway_1042 () at uq_multiplex.ml:20 #20 0x0000000000531abe in camlUq_multiplex__fun_3594 () at uq_multiplex.ml:464 #21 0x000000000051b33e in camlUnixqueue_pollset__forward_event_to_1567 () at unixqueue_pollset.ml:768 #22 0x0000000000517f88 in camlEqueue__fun_1262 () at equeue.ml:166 #23 0x00000000005eaea9 in camlQueue__iter_1048 () at queue.ml:135 #24 0x0000000000518a02 in camlEqueue__run_1070 () at equeue.ml:159 #25 0x000000000051cdc9 in camlUnixqueue_pollset__fun_3391 () at unixqueue_pollset.ml:999 #26 0x00000000004e54a4 in camlNetplex_main__run_controller_1077 () at netplex_main.ml:130 #27 0x00000000004e4348 in camlNetplex_main__fun_1295 () at netplex_main.ml:312 ---Type <return> to continue, or q <return> to quit--- #28 0x00000000004e574d in camlNetplex_main__redirect_logger_1094 () at netplex_main.ml:187 #29 0x00000000004e4a20 in camlNetplex_main__fun_1287 () at netplex_main.ml:294 #30 0x000000000043c2f4 in camlServer__entry () #31 0x00000000004063e9 in caml_program () #32 0x0000000000642934 in caml_start_program () #33 0x0000000000630a0a in caml_main () #34 0x0000000000630a4c in main () Reading over the line 141 of glibc's fork.c, it seems this assert fails: assert (THREAD_GETMEM (self, tid) != ppid); This is triggered when the worker process PID in the namespace reaches that of the ancestor on the host. Cheers, Thomas On Mon, Jun 1, 2015 at 6:10 PM, Thomas Calderon <cal...@gm...> wrote: > OK, > > I am investigating further, I have some more hints that it might be > related to the Linux PID namespace implementation. > I am trying to reproduce the issue outside OCaml/Ocamlnet. > > I will keep you posted. > > Thanks > > On Mon, Jun 1, 2015 at 6:05 PM, Gerd Stolpmann <in...@ge...> > wrote: > >> Just a guess: There is a Unix.getpid call in Netplex_mp. This call >> returns the PID in the new PID space, and this PID is different from the >> PID returned by fork() (look into the sources of Netplex_mp, where both >> is done). I do not remember for what the PID is used, but it is probably >> a key in a management data structure. Then (and this is the unverified >> part of my guess), some lookup fails that normally cannot fail, and the >> controller gets confused. >> >> Note that you do not see the getpid() calls in the strace because it is >> not a real syscall (afaik the kernel just writes the PID into some >> memory location after fork/clone, where glibc expects it). >> >> I don't know whether this is really the problem, but if so, the fix is >> probably not trivial. The controller would have to tell the container >> via the control socket what the PID from the view of the controller is; >> or via the pipe that is used inside Netplex_mp for synchronization. >> >> The restart_syscall thing you observed is just a poll waiting for an >> event. strace just doesn't print it cleanly. >> >> Gerd >> >> Am Montag, den 01.06.2015, 13:03 +0200 schrieb Thomas Calderon: >> > Hello Gerd, >> > >> > >> > I do not think I reach the limit of the maximum number of processes >> > since I have at most 3 defunct processes. >> > I would also be likely to see some other message indicating I reached >> > this limit (GrSecurity would leave a trace). >> > >> > >> > When attaching to stalled instances, the controller and worker >> > instances (except one) are blocked on : >> > restart_syscall(<... resuming interrupted call ...> >> > >> > >> > As mentioned, one of the worker process is blocked on: >> > futex(0x...., FUTEX_WAIT_PRIVATE, 2, NULL >> > >> > >> > You will find the strace -f as an attachment. >> > >> > >> > >> > Cheers, >> > >> > >> > Thomas >> > >> > On Mon, Jun 1, 2015 at 12:13 PM, Gerd Stolpmann >> > <in...@ge...> wrote: >> > Am Montag, den 01.06.2015, 11:06 +0200 schrieb Thomas >> > Calderon: >> > > Hi, >> > > >> > > >> > > We are observing an issue when using OCamlnet netplex in >> > combination >> > > with VServer PID namespaces. >> > > We are using Netplex in the multi-process mode. >> > > >> > > >> > > Here is what we are doing: >> > > - start our netplex controller >> > > - use the post_add_hook to enter a new PID namespace >> > > - use dynamic workload manager to spawn child workers >> > > - configured with conn_limit=1 >> > > >> > > >> > > - launch a loop of client connections >> > > - this spawns a new worker process for each connection >> > > >> > > >> > > After several successful connections of the loop, clients >> > cannot >> > > connect anymore. >> > > We observe some worker processes in a defunct/zombie state. >> > > The controller and running worker processes seem deadlocked >> > in some >> > > condition. >> > > >> > > >> > > When we do not use the post_add_hook to enter a new PID >> > namespace, the >> > > problem cannot be triggered anymore. >> > > >> > > >> > > Do you have any hint on this? >> > >> > >> > The controller runs of course waitpid() on the terminated >> > processes to >> > un-zombie these, and obviously this does not work. I guess you >> > reach >> > then the maximum number of processes after some time. >> > >> > You say "vServer" but there are several such technologies >> > (Linux >> > containers, Virtuozzo, maybe some derived products). I also >> > don't know >> > much about this corner of the OS. >> > >> > What would definitely help is an strace -f of the server. >> > >> > Gerd >> > >> > >> > >> > > >> > > >> > > Many thanks. >> > > >> > > >> > > Thomas >> > >> > -- >> > ------------------------------------------------------------ >> > Gerd Stolpmann, Darmstadt, Germany ge...@ge... >> > My OCaml site: http://www.camlcity.org >> > Contact details: http://www.camlcity.org/contact.html >> > Company homepage: http://www.gerd-stolpmann.de >> > ------------------------------------------------------------ >> > >> > >> > >> >> -- >> ------------------------------------------------------------ >> Gerd Stolpmann, Darmstadt, Germany ge...@ge... >> My OCaml site: http://www.camlcity.org >> Contact details: http://www.camlcity.org/contact.html >> Company homepage: http://www.gerd-stolpmann.de >> ------------------------------------------------------------ >> >> > |