|
From: Ashley P. <as...@qu...> - 2005-06-16 09:59:32
|
On Wed, 2005-06-15 at 12:24 -0700, Jeremy Fitzhardinge wrote:
> Ashley Pittman wrote:
>
> >I'd be surprised if many programs actually call elan3_detach() though,
> >there are no hooks from MPI_Finilize through so it probably never gets
> >called.
> >
> >
> So it's probably the result of an explicit close()?
Probably the result on an implicit close(). Very few programs call
detach or close so it will come from the fd being closed on program
teardown.
> >Going back to the original questions, the thread should be implicitly be
> >woken and then die when the parent thread terminates, hence the deadlock
> >if the parent thread isn't exiting. How does V work WTR any other
> >blocking syscall being in progress at program exit?
> >
> If a thread calls exit_group(), Valgrind hits any thread blocked in a
> syscall with a signal to get it out of the kernel, and tells all threads
> to terminate; once they're all dead the process exits. Normally this
> happens more or less instantaneously, but if a thread refuses to come
> out of the kernel for some reason it will hold things up. That's the
> 2.6/NPTL thread model.
This should work, any signal will cause it to return to userspace
briefly and if it's sigterm then the thread should exit whilst it's
there.
> In the 2.4/LinuxThreads case, the threads library coordinates the
> process termination by getting each thread to explicitly call exit().
> There are some tricky edge cases depending on whether the manager thread
> or the initial thread is the last to exit. Again, Valgrind only exits
> once all threads have terminated.
This isn't going to happen until what I guess you are referring to as
the initial thread has exited, couldn't deadlock also happen this way?
How exactly does it do in this coordination?
> Now, your elan thread is created by a native clone() rather than via
> pthread_create, right? Are you creating the thread in the same thread
> group as the rest of the program, or in a separate thread group?
I'm not familiar enough with low-level threads to tell, I assume it's in
the same thread group as we don't do anything special to request it's
own. Here is the code in question:
if ((res = __clone (elan3_lwp, stack + ELANLWP_STACK_SIZE,
CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND,
(void *) ctx)) == -1)
> If the
> main program terminates with exit_group, but the elan thread is not in
> the thread group, then Valgrind will not attempt to kill it, but will
> still wait around for it to exit; if the elan thread is waiting for the
> Valgrind thread to exit, then we're in a deadlock. I guess that's
> what's happening. There are two fairly easy solutions:
>
> 1. change the elan driver to create the thread in the same thread
> group as the rest of the process, so exit_group() does the
> expected thing, or
> 2. hack exit_group() so it just kills all threads in the process
> rather than just the thread group
>
> Option 2 might be preferred. It isn't strictly correct, but using
> multiple thread groups within a process is pretty rare, except in the
> degenerate case where every thread is in its own group (as you get with
> LinuxThreads).
I can certainly try option 1 if required but option 2 would be
preferred. Generally people with sizable clusters regard stability
above all else and lead times to pushing new software releases on can be
extensive.
> You could make it a --weird-hack
> (exit-nukes-everything?) specifically for this case.
There is already a command line option to allow this thread to be
created so extending it to cover this shouldn't cause any problems.
I assume Valgrind itself runs in the same thread as (and shared fd's
with) the main thread of the host program?
> Or perhaps the alternative is to explicitly get the elan thread to
> terminate as part of the programs cleanup/shutdown actions (ie, do the
> appropriate call in MPI_Finalize).
This isn't a very good solution, it would rely on programs being
"well-behaved" for V to run correctly. There is no equivalent for shmem
(the CRAY api) and then elan_fini() function is only used in a
scattering of cases.
Ashley,
|