From: Jiajun C. <ji...@cc...> - 2017-08-01 16:37:26
|
Hi Rob, I know what's going on here after browsing some blas source code. As Rohan mentioned, thread 42 is behaving normally. What looks more interesting is the calling stack of thread 1, the main thread, especially the bottom 5 frames. It seems that the blas library registers some handler at fork, using pthread_atfork(): https://github.com/xianyi/OpenBLAS/blob/e70a6b92bf9f69f113e198e3ded558ed0675b038/driver/others/memory.c The handler tries to join a bunch of other blas threads, in this case, including thread 41. The problem is, DMTCP has its internal synchronization semantics. When a thread exits, it tries to grab a lock to preventing a checkpoint from happening: it is an internal critical section. On the other hand, when the application calls fork(), it also grabs the lock before fork() returns: another internal critical section. That's why we see a deadlock here: thread 1: lock acquired, and waiting for thread 41 to finish thread 41: trying to acquire the lock before it exits Best, Jiajun On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote: > Hi Rob, > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every > user process that's launched using DMTCP. It seems to be simply > waiting for a message from the coordinator, which is the normal > state of things. It's not stuck or the cause of the deadlock here. > > The deadlock is definitely because of one or more user threads. It > seems like the program also forks and execs some child processes. > We have seen some corner cases around that but it's difficult to > say what it is with the information below. > > The fact that even `--disable-all-plugins` didn't work is quite > interesting. > > I have a few other suggestions: > > - Is it possible to share perhaps the top 20-25 frames from each of > the threads? (You can also send them to us/me directly at: > dm...@cc.../ro...@cc...) > > Also, after you save the backtraces from all the threads, detach > from gdb, wait for a few seconds, re-attach, and save a second > set of backtraces. It'll help us figure out which threads are > really stuck. > > - Is it possible to run with other Python versions (not 3.4) as a > quick test? > > - What Numpy/OpenBLAS versions are you using? I'll try to run > a few experiments locally. > > - Is it possible try to cut down your Python script to the bare > minimum, to isolate the issue? Perhaps just with a few imports > and then add code incrementally. But if it turns out to be more > work than its worth, then we should skip this suggestion. > > - Finally, if it's possible, we could arrange for a screen sharing > session. I think it'll be more efficient for interactively > debugging this. > > -Rohan > > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote: > > Hi Rohan, > > > > Thanks for the quick reply. So using version 2.4.8 and the > > --disable-alloc-plugin and --disable-dl-plugin the behavior is the same. I > > also tried --disable-all-plugins and it still did not work, so I think this > > may be different from the numpy bug you identified, though as you will will > > see the stack trace does show one of the threads in the numpy / openblas > > library. > > > > Though we are running at NERSC, I am not running this on the cray systems > > (yet) and not through slurm (yet). > > > > When I attach to the running process, unfortunately there are a ton of > > threads and a deep python stack. Apparently even though I only requested 1 > > thread, the python library allocates a ton of them. I tried to remove > > duplicates manually for you. It looks like thread 42 and 1 are the > > interesting ones. 2-41 are all the same. It looks to me like > > dmtcp_coordinator is interfering with the blas library's pthread_join calls. > > > > > > > > > > Thread 42 (Thread 0x7f266738e700 (LWP 14136)): > > #0 0x00007f26690ed14d in read () from /lib/libpthread.so.0 > > #1 0x00007f2669804224 in _real_read (fd=<optimized out>, buf=<optimized > > out>, count=<optimized out>) at syscallsreal.c:504 > > #2 0x00007f26697eae80 in dmtcp::Util::readAll (fd=821, buf=0x7f266738bc00, > > count=136) at util_misc.cpp:279 > > #3 0x00007f26697c303b in operator>><dmtcp::DmtcpMessage> (t=..., > > this=0x7f2669c33408) at ../jalib/jsocket.h:109 > > #4 dmtcp::CoordinatorAPI::recvMsgFromCoordinator (this=0x7f2669c33408, > > msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422 > > #5 0x00007f26697bb9fc in dmtcp::DmtcpWorker::waitForCoordinatorMsg > > (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469 > > #6 0x00007f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend () at > > dmtcpworker.cpp:513 > > #7 0x00007f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint > > (sec=<optimized out>) at mtcpinterface.cpp:61 > > #8 0x00007f26697d85e9 in checkpointhread (dummy=<optimized out>) at > > threadlist.cpp:359 > > #9 0x00007f26697cfb9e in pthread_start (arg=<optimized out>) at > > threadwrappers.cpp:159 > > #10 0x00007f26690e58ca in start_thread () from /lib/libpthread.so.0 > > #11 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c15008) at > > threadwrappers.cpp:68 > > #12 0x00007f26687c3abd in clone () from /lib/libc.so.6 > > #13 0x0000000000000000 in ?? () > > > > Thread 41 (Thread 0x7f2660f99700 (LWP 14137)): > > #0 0x00007f26690ed86d in nanosleep () from /lib/libpthread.so.0 > > #1 0x00007f26697bf599 in dmtcp::ThreadSync::wrapperExecutionLockLock () at > > threadsync.cpp:392 > > #2 0x00007f26697cfba6 in pthread_start (arg=<optimized out>) at > > threadwrappers.cpp:161 > > #3 0x00007f26690e58ca in start_thread () from /lib/libpthread.so.0 > > #4 0x00007f26697cf9c9 in clone_start (arg=0x7f2669c24008) at > > threadwrappers.cpp:68 > > #5 0x00007f26687c3abd in clone () from /lib/libc.so.6 > > #6 0x0000000000000000 in ?? () > > > > Thread 40 (Thread 0x7f2660798700 (LWP 14138)): > > ... > > > > Thread 1 (Thread 0x7f2669c26740 (LWP 14131)): > > #0 0x00007f26690ece69 in __lll_timedwait_tid () from /lib/libpthread.so.0 > > #1 0x00007f26690e6ebf in pthread_timedjoin_np () from /lib/libpthread.so.0 > > #2 0x00007f26697d0048 in pthread_join (thread=139802812454656, retval=0x0) > > at threadwrappers.cpp:274 > > #3 0x00007f26615cd2d7 in blas_thread_shutdown_ () from > > /global/homes/r/regan/.local/lib/python3.4/site-packages/numpy/core/../.libs/ > > libopenblasp-r0-39a31c03.2.18.so > > #4 0x00007f26687936d5 in fork () from /lib/libc.so.6 > > #5 0x00007f26697c9724 in fork () at execwrappers.cpp:186 > > #6 0x00007f2665b9ed7f in subprocess_fork_exec (self=<optimized out>, > > args=<optimized out>) at > > /global/common/genepool/usg/languages/python/build/Python-3.4.3/Modules/_posixsubprocess.c:649 > > #7 0x00007f2669427225 in call_function (oparg=<optimized out>, > > pp_stack=0x7fff86d8d690) at Python/c > > eval.c:4237 > > #8 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #9 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=19, kws=0x7f25f4014ad8, kwcount=0, defs=0x0, > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #10 0x00007f2669427ca4 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=<optimized out>, pp_stack=0x7fff86d8d880, func=0x7f2669a60ea0) at > > Python/ceval.c:4344 > > #11 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8d880) at > > Python/ceval.c:4262 > > #12 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #13 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=2, kws=0x7f25fa0dda70, kwcount=4, > > defs=0x7f2669a55110, defcount=16, kwdefs=0x0, closure=0x0) at > > Python/ceval.c:3588 > > #14 0x00007f266938a2bc in function_call (func=0x7f2669a60840, > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/funcobject.c:632 > > #15 0x00007f266935e8be in PyObject_Call (func=0x7f2669a60840, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #16 0x00007f2669375354 in method_call (func=0x7f2669a60840, > > arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/classobject.c:347 > > #17 0x00007f266935e8be in PyObject_Call (func=0x7f2663ff4148, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #18 0x00007f26693c15f7 in slot_tp_init (self=0x7f25f9f45198, > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:6177 > > #19 0x00007f26693bd17f in type_call (type=<optimized out>, > > args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:898 > > #20 0x00007f266935e8be in PyObject_Call (func=0x1b8c0a8, arg=<optimized > > out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #21 0x00007f2669425a06 in do_call (nk=<optimized out>, na=<optimized out>, > > pp_stack=0x7fff86d8dc80, func=0x1b8c0a8) at Python/ceval.c:4466 > > #22 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8dc80) at > > Python/ceval.c:4264 > > #23 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #24 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=1, pp_stack=0x7fff86d8ddc0, func=0x7f25f9e96bf8) at > > Python/ceval.c:4334 > > #25 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8ddc0) at > > Python/ceval.c:4262 > > #26 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #27 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=1, pp_stack=0x7fff86d8df00, func=0x7f25f9e96c80) at > > Python/ceval.c:4334 > > #28 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8df00) at > > Python/ceval.c:4262 > > #29 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #30 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #31 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:775 > > #32 0x00007f266941d3d8 in builtin_exec (self=<optimized out>, > > args=<optimized out>) at Python/bltinmodule.c:882 > > #33 0x00007f2669426247 in ext_do_call (nk=0, na=0, flags=<optimized out>, > > pp_stack=0x7fff86d8e1f0, func=0x7f2669c05788) at Python/ceval.c:4558 > > #34 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2878 > > #35 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=3, kws=0x7f2663d13938, kwcount=0, defs=0x0, > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #36 0x00007f2669427ca4 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=<optimized out>, pp_stack=0x7fff86d8e3e0, func=0x7f2669bd3b70) at > > Python/ceval.c:4344 > > #37 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e3e0) at > > Python/ceval.c:4262 > > #38 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #39 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=2, pp_stack=0x7fff86d8e520, func=0x7f2669b5d158) at > > Python/ceval.c:4334 > > #40 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e520) at > > Python/ceval.c:4262 > > #41 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #42 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=2, pp_stack=0x7fff86d8e660, func=0x7f2669b56ea0) at > > Python/ceval.c:4334 > > #43 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e660) at > > Python/ceval.c:4262 > > #44 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #45 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=1, pp_stack=0x7fff86d8e7a0, func=0x7f2669b590d0) at > > Python/ceval.c:4334 > > #46 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e7a0) at > > Python/ceval.c:4262 > > #47 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #48 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=2, pp_stack=0x7fff86d8e8e0, func=0x7f2669b61620) at > > Python/ceval.c:4334 > > #49 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8e8e0) at > > Python/ceval.c:4262 > > #50 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #51 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #52 0x00007f266938a1b5 in function_call (func=0x7f2669b616a8, > > arg=0x7f25f9f2cf48, kw=0x0) at Objects/funcobject.c:632 > > #53 0x00007f266935e8be in PyObject_Call (func=0x7f2669b616a8, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #54 0x00007f266935f2cd in _PyObject_CallMethodIdObjArgs > > (callable=0x7f2669b616a8, name=<optimized out>) at Objects/abstract.c:2310 > > #55 0x00007f26694427bc in PyImport_ImportModuleLevelObject > > (name=0x7f25fb28aa78, given_globals=<optimized out>, locals=<optimized > > out>, given_fromlist=<optimized out>, level=0) at Python/import.c:1636 > > #56 0x00007f266941dcaf in builtin___import__ (self=<optimized out>, > > args=<optimized out>, kwds=<optimized out>) at Python/bltinmodule.c:210 > > #57 0x00007f266935e8be in PyObject_Call (func=0x7f2669c053c8, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #58 0x00007f2669420227 in PyEval_CallObjectWithKeywords > > (func=0x7f2669c053c8, arg=0x7f25fa0dfc50, kw=<optimized out>) at > > Python/ceval.c:4114 > > #59 0x00007f2669421907 in PyEval_EvalFrameEx (f=<optimized out>, > > throwflag=<optimized out>) at Python/ceval.c:2478 > > #60 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=2, kws=0x7f25f9f3e3d0, kwcount=2, > > defs=0x7f25fa5ed360, defcount=2, kwdefs=0x0, closure=0x0) at > > Python/ceval.c:3588 > > #61 0x00007f2669427ca4 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=<optimized out>, pp_stack=0x7fff86d8eff0, func=0x7f25fa5efe18) at > > Python/ceval.c:4344 > > #62 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8eff0) at > > Python/ceval.c:4262 > > #63 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #64 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=3, kws=0x7f2669bd5060, kwcount=0, defs=0x0, > > defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #65 0x00007f266938a2bc in function_call (func=0x7f25fa5efd08, > > arg=0x7f25fa0f4ab0, kw=0x7f25f9f2cdc8) at Objects/funcobject.c:632 > > #66 0x00007f25fc7ae500 in __Pyx_PyObject_Call (kw=0x7f25f9f2cdc8, > > arg=<optimized out>, func=<optimized out>) at > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_objects.c:7420 > > #67 __pyx_pf_4h5py_8_objects_9with_phil_wrapper > > (__pyx_v_kwds=0x7f25f9f2cdc8, __pyx_v_args=0x7f25fa0f4ab0, > > __pyx_self=<optimized out>) > > at > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_objects.c:2429 > > #68 __pyx_pw_4h5py_8_objects_9with_phil_1wrapper (__pyx_self=<optimized > > out>, __pyx_args=0x7f25fa0f4ab0, __pyx_kwds=<optimized out>) > > at > > /global/scratch2/sd/gpinterp/pip-build-5fy4tlgm/h5py/h5py/_objects.c:2345 > > #69 0x00007f266935e8be in PyObject_Call (func=0x7f25fa5ec498, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #70 0x00007f2669375354 in method_call (func=0x7f25fa5ec498, > > arg=0x7f25fa0f4ab0, kw=0x0) at Objects/classobject.c:347 > > #71 0x00007f266935e8be in PyObject_Call (func=0x7f2663d3a048, > > arg=<optimized out>, kw=<optimized out>) at Objects/abstract.c:2040 > > #72 0x00007f26693be3ec in call_method (o=<optimized out>, > > nameid=0x7f266974d0d0 <PyId___setitem__>, format=0x7f26694e7609 "(OO)") at > > Objects/typeobject.c:1429 > > #73 0x00007f26693be541 in slot_mp_ass_subscript (self=<optimized out>, > > key=<optimized out>, value=<optimized out>) at Objects/typeobject.c:5751 > > #74 0x00007f2669424d0d in PyEval_EvalFrameEx (f=<optimized out>, > > throwflag=<optimized out>) at Python/ceval.c:1831 > > #75 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=3, kws=0x7f25f9f27e68, kwcount=0, > > defs=0x7f25fb259300, defcount=1, kwdefs=0x0, closure=0x0) at > > Python/ceval.c:3588 > > #76 0x00007f2669427ca4 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=<optimized out>, pp_stack=0x7fff86d8f700, func=0x7f25fa0db730) at > > Python/ceval.c:4344 > > #77 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8f700) at > > Python/ceval.c:4262 > > #78 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #79 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=3, pp_stack=0x7fff86d8f840, func=0x7f25fa0db6a8) at > > Python/ceval.c:4334 > > #80 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8f840) at > > Python/ceval.c:4262 > > #81 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #82 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=5, kws=0x7f25f424f280, kwcount=0, > > defs=0x7f25fe2d1f08, defcount=1, kwdefs=0x0, closure=0x0) at > > Python/ceval.c:3588 > > #83 0x00007f2669427ca4 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=<optimized out>, pp_stack=0x7fff86d8fa30, func=0x7f25fa0d8e18) at > > Python/ceval.c:4344 > > #84 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fa30) at > > Python/ceval.c:4262 > > #85 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #86 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=3, pp_stack=0x7fff86d8fb70, func=0x7f25fa0e1620) at > > Python/ceval.c:4334 > > #87 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fb70) at > > Python/ceval.c:4262 > > #88 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #89 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=3, pp_stack=0x7fff86d8fcb0, func=0x7f25fa0fab70) at > > Python/ceval.c:4334 > > #90 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fcb0) at > > Python/ceval.c:4262 > > #91 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #92 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=3, pp_stack=0x7fff86d8fdf0, func=0x7f25fe2c16a8) at > > Python/ceval.c:4334 > > #93 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8fdf0) at > > Python/ceval.c:4262 > > #94 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #95 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=1, pp_stack=0x7fff86d8ff30, func=0x7f25fe2c1620) at > > Python/ceval.c:4334 > > #96 call_function (oparg=<optimized out>, pp_stack=0x7fff86d8ff30) at > > Python/ceval.c:4262 > > #97 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #98 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=4, pp_stack=0x7fff86d90070, func=0x7f25fa0fc048) at > > Python/ceval.c:4334 > > #99 call_function (oparg=<optimized out>, pp_stack=0x7fff86d90070) at > > Python/ceval.c:4262 > > #100 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #101 0x00007f2669427d68 in fast_function (nk=<optimized out>, na=<optimized > > out>, n=1, pp_stack=0x7fff86d901b0, func=0x7f25fe2c19d8) at > > Python/ceval.c:4334 > > #102 call_function (oparg=<optimized out>, pp_stack=0x7fff86d901b0) at > > Python/ceval.c:4262 > > #103 PyEval_EvalFrameEx (f=<optimized out>, throwflag=<optimized out>) at > > Python/ceval.c:2838 > > #104 0x00007f2669428cec in PyEval_EvalCodeEx (_co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>, args=<optimized out>, > > argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, > > kwdefs=0x0, closure=0x0) at Python/ceval.c:3588 > > #105 0x00007f2669428dbb in PyEval_EvalCode (co=<optimized out>, > > globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:775 > > #106 0x00007f266944f770 in run_mod (arena=0x1a62b70, flags=0x7fff86d904e0, > > locals=0x7f2669b7a448, globals=0x7f2669b7a448, filename=0x7f2669ab7b20, > > mod=<optimized out>) at Python/pythonrun.c:2180 > > #107 PyRun_FileExFlags (fp=0x1ab3c20, filename_str=<optimized out>, > > start=<optimized out>, globals=0x7f2669b7a448, locals=0x7f2669b7a448, > > closeit=1, flags=0x7fff86d904e0) at Python/pythonrun.c:2133 > > #108 0x00007f26694504b9 in PyRun_SimpleFileExFlags (fp=0x1ab3c20, > > filename=<optimized out>, closeit=1, flags=0x7fff86d904e0) at > > Python/pythonrun.c:1606 > > #109 0x00007f2669468231 in run_file (p_cf=0x7fff86d904e0, > > filename=0x19f5830 > > L"/global/homes/r/regan/.local/bin/read_fast5_basecaller.py", fp=0x1ab3c20) > > at Modules/main.c:319 > > #110 Py_Main (argc=<optimized out>, argv=<optimized out>) at > > Modules/main.c:751 > > #111 0x0000000000400ae3 in main (argc=14, argv=0x7fff86d90658) at > > ./Modules/python.c:69 > > > > > > On Sat, Jul 29, 2017 at 7:21 AM, Rohan Garg <ro...@cc...> wrote: > > > > > I just realized that there's an open issue related to Python and > > > Numpy. See here: https://github.com/dmtcp/dmtcp/issues/599. > > > Unfortunately, I haven't been able to spend enough time to get a > > > fix out. > > > > > > Could you please verify if this isn't what's going on? Even if it's > > > not the exact same libraries, I think the bug, as reported on the > > > Github issue, is more general and could manifest through different > > > libraries, although with a similar backtrace signature. > > > > > > On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote: > > > > Hi Rob, > > > > > > > > Thanks for trying out DMTCP and reporting this issue. We'd be happy to > > > > help resolve this issue. > > > > > > > > I'll take a look at the strace, but probably the most useful thing > > > > here would be to try to figure out the process that's hanging and > > > > to look at its backtrace. The backtrace from the parent could also > > > > be useful. You should be able to attach gdb to the interesting > > > > processes and look at the backtrace. > > > > > > > > $ gdb attach <pid> > > > > ... > > > > (gdb) thread apply all bt > > > > > > > > This can hopefully tell us the reason for the deadlock. Also, as a quick > > > > test could you please try running your program with the following DMTCP > > > > options? > > > > > > > > $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin > > > <program-and-program-arguments> > > > > > > > > Also, I'd recommend working with the 2.4.x/2.5.x versions; we are > > > > not maintaining the 1.x branch any longer. > > > > > > > > (I do have a NERSC account and have recently diagnosed some issues > > > > with the CRAY toolchain/srun and DMTCP there. But it seems like > > > > that that's not an issue here.) > > > > > > > > Thanks, > > > > Rohan > > > > > > > > On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote: > > > > > Hello, > > > > > > > > > > I am trying to adopt dmtcp at our institution and ran into a problem > > > with > > > > > the first script that I tried. The script seems to start fine and > > > perform > > > > > a few operations, but then stalls indefinitely consuming 0 cpu in what > > > > > looks like a futex wait loop after the process was cloned. It is a > > > > > multi-threaded program and the behavior is the same whether 1 or more > > > > > threads is requested. > > > > > > > > > > This is a 3rd party python script, so unfortunately I can't share the > > > code > > > > > but I uploaded the strace from a 1-threaded execution: > > > > > > > > > > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz > > > > > <http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall- > > > read_fast5_basecaller.py-strace.log.gz> > > > > > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall- > > > read_fast5_basecaller.py-strace.log.gz > > > > > > > > > > This is how I executed the dmctp & strace & script: > > > > > > > > > > dmtcp_launch strace ~/.local/bin/read_fast5_basecaller.py -i > > > uploaded/0/ > > > > > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit > > > SQK-LSK308 > > > > > > > > > > > > > > > And here is the output from the dmtcp_coordinator (which I commanded to > > > > > kill after a few minutes of inactivity): > > > > > > > > > > > > > > > regan@gpint200:/global/projectb/scratch/regan/ > > > nanopore/runs/X0124-onebatch2$ > > > > > dmtcp_coordinator > > > > > dmtcp_coordinator starting... > > > > > Host: gpint200 (127.0.0.2) > > > > > Port: 7779 > > > > > Checkpoint Interval: 18000 > > > > > Exit on last client: 0 > > > > > Type '?' for help. > > > > > > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval; > > > > > REASON='CheckpointInterval updated (for this computation only)' > > > > > oldInterval = 18000 > > > > > theCheckpointInterval = 18000 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > > > > connected' > > > > > hello_remote.from = 704891f6494a6304-10275-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > > > > process Information after exec()' > > > > > progname = strace > > > > > msg.from = 704891f6494a6304-40000-597be341 > > > > > client->identity() = 704891f6494a6304-10275-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > > > > connected' > > > > > hello_remote.from = 704891f6494a6304-40000-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating > > > > > process Information after fork()' > > > > > client->hostname() = gpint200 > > > > > client->progname() = strace_(forked) > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > > > > process Information after exec()' > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > > > > connected' > > > > > hello_remote.from = 704891f6494a6304-40000-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating > > > > > process Information after fork()' > > > > > client->hostname() = gpint200 > > > > > client->progname() = strace_(forked) > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating > > > > > process Information after exec()' > > > > > progname = python3.4 > > > > > msg.from = 704891f6494a6304-41000-597be341 > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker > > > > > connected' > > > > > hello_remote.from = 704891f6494a6304-41000-597be341 > > > > > k > > > > > [10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand; > > > > > REASON='Killing all connected Peers...' > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > REASON='client > > > > > disconnected' > > > > > client->identity() = 704891f6494a6304-40000-597be341 > > > > > client->progname() = strace > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > REASON='client > > > > > disconnected' > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > client->progname() = python3.4_(forked) > > > > > [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; > > > REASON='client > > > > > disconnected' > > > > > client->identity() = 704891f6494a6304-41000-597be341 > > > > > client->progname() = python3.4 > > > > > l > > > > > Client List: > > > > > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > > > > > q > > > > > [10273] NOTE at dmtcp_coordinator.cpp:590 in handleUserCommand; > > > > > REASON='killing all connected peers and quitting ...' > > > > > DMTCP coordinator exiting... (per request) > > > > > > > > > > > > > > > This is a process that on large data sets can take a few days on a 32 > > > core > > > > > node, so it is something that we would really like to be able to > > > checkpoint > > > > > on our cluster. > > > > > > > > > > Please let me know how I can help debug this with you. I've tried > > > versions > > > > > 1.2.5 and 2.4.8 and 2.5 and they all have the same problem. > > > > > > > > > > Thanks, > > > > > Rob Egan > > > > > > > > > ------------------------------------------------------------ > > > ------------------ > > > > > Check out the vibrant tech community on one of the world's most > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > > > > > > _______________________________________________ > > > > > Dmtcp-forum mailing list > > > > > Dmt...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > ------------------------------------------------------------ > > > ------------------ > > > > Check out the vibrant tech community on one of the world's most > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > > > > Dmtcp-forum mailing list > > > > Dmt...@li... > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmt...@li... > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Dmtcp-forum mailing list > Dmt...@li... > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum |