|
From: Jeremy F. <je...@go...> - 2005-06-15 20:14:49
|
Ashley Pittman wrote:
>I'd be surprised if many programs actually call elan3_detach() though,
>there are no hooks from MPI_Finilize through so it probably never gets
>called.
>
>
So it's probably the result of an explicit close()?
>>In the 2.6 NPTL thread model, exit_group() terminates all threads in the
>>thread group atomically, so there's no waiting around for things to
>>terminate (or dependence on termination order). Is this running in a
>>2.4 thread model, or a 2.6 one? It sounds like the container machinery
>>has an atomic group termination property similar to exit_group().
>>
>>
>
>It does sound similar, it works across child programs though, not just
>thread groups. Probably not relevant to this bug however.
>
>Going back to the original questions, the thread should be implicitly be
>woken and then die when the parent thread terminates, hence the deadlock
>if the parent thread isn't exiting. How does V work WTR any other
>blocking syscall being in progress at program exit?
>
If a thread calls exit_group(), Valgrind hits any thread blocked in a
syscall with a signal to get it out of the kernel, and tells all threads
to terminate; once they're all dead the process exits. Normally this
happens more or less instantaneously, but if a thread refuses to come
out of the kernel for some reason it will hold things up. That's the
2.6/NPTL thread model.
In the 2.4/LinuxThreads case, the threads library coordinates the
process termination by getting each thread to explicitly call exit().
There are some tricky edge cases depending on whether the manager thread
or the initial thread is the last to exit. Again, Valgrind only exits
once all threads have terminated.
Now, your elan thread is created by a native clone() rather than via
pthread_create, right? Are you creating the thread in the same thread
group as the rest of the program, or in a separate thread group? If the
main program terminates with exit_group, but the elan thread is not in
the thread group, then Valgrind will not attempt to kill it, but will
still wait around for it to exit; if the elan thread is waiting for the
Valgrind thread to exit, then we're in a deadlock. I guess that's
what's happening. There are two fairly easy solutions:
1. change the elan driver to create the thread in the same thread
group as the rest of the process, so exit_group() does the
expected thing, or
2. hack exit_group() so it just kills all threads in the process
rather than just the thread group
Option 2 might be preferred. It isn't strictly correct, but using
multiple thread groups within a process is pretty rare, except in the
degenerate case where every thread is in its own group (as you get with
LinuxThreads). You could make it a --weird-hack
(exit-nukes-everything?) specifically for this case.
Or perhaps the alternative is to explicitly get the elan thread to
terminate as part of the programs cleanup/shutdown actions (ie, do the
appropriate call in MPI_Finalize).
J
|