|
From: Julian S. <js...@ac...> - 2005-06-16 11:04:15
|
Jeremy, Ashley, I appreciate you both looking into this. I'm unclear as to whether you grokked that I changed the exit semantics in the 3 line a couple of weeks back to use the "last-one-out-turn-out-the-lights" semantics. As a result (following some further GDT-copying entertainment) the Elan3 driver now runs fine on Valgrind, and nothing else appears to be broken as a result. J On Thursday 16 June 2005 10:56, Ashley Pittman wrote: > On Wed, 2005-06-15 at 12:24 -0700, Jeremy Fitzhardinge wrote: > > Ashley Pittman wrote: > > >I'd be surprised if many programs actually call elan3_detach() though, > > >there are no hooks from MPI_Finilize through so it probably never gets > > >called. > > > > So it's probably the result of an explicit close()? > > Probably the result on an implicit close(). Very few programs call > detach or close so it will come from the fd being closed on program > teardown. > > > >Going back to the original questions, the thread should be implicitly be > > >woken and then die when the parent thread terminates, hence the deadlock > > >if the parent thread isn't exiting. How does V work WTR any other > > >blocking syscall being in progress at program exit? > > > > If a thread calls exit_group(), Valgrind hits any thread blocked in a > > syscall with a signal to get it out of the kernel, and tells all threads > > to terminate; once they're all dead the process exits. Normally this > > happens more or less instantaneously, but if a thread refuses to come > > out of the kernel for some reason it will hold things up. That's the > > 2.6/NPTL thread model. > > This should work, any signal will cause it to return to userspace > briefly and if it's sigterm then the thread should exit whilst it's > there. > > > In the 2.4/LinuxThreads case, the threads library coordinates the > > process termination by getting each thread to explicitly call exit(). > > There are some tricky edge cases depending on whether the manager thread > > or the initial thread is the last to exit. Again, Valgrind only exits > > once all threads have terminated. > > This isn't going to happen until what I guess you are referring to as > the initial thread has exited, couldn't deadlock also happen this way? > How exactly does it do in this coordination? > > > Now, your elan thread is created by a native clone() rather than via > > pthread_create, right? Are you creating the thread in the same thread > > group as the rest of the program, or in a separate thread group? > > I'm not familiar enough with low-level threads to tell, I assume it's in > the same thread group as we don't do anything special to request it's > own. Here is the code in question: > > if ((res = __clone (elan3_lwp, stack + ELANLWP_STACK_SIZE, > CLONE_VM | CLONE_FS | CLONE_FILES | > CLONE_SIGHAND, > (void *) ctx)) == -1) > > > If the > > main program terminates with exit_group, but the elan thread is not in > > the thread group, then Valgrind will not attempt to kill it, but will > > still wait around for it to exit; if the elan thread is waiting for the > > Valgrind thread to exit, then we're in a deadlock. I guess that's > > what's happening. There are two fairly easy solutions: > > > > 1. change the elan driver to create the thread in the same thread > > group as the rest of the process, so exit_group() does the > > expected thing, or > > 2. hack exit_group() so it just kills all threads in the process > > rather than just the thread group > > > > Option 2 might be preferred. It isn't strictly correct, but using > > multiple thread groups within a process is pretty rare, except in the > > degenerate case where every thread is in its own group (as you get with > > LinuxThreads). > > I can certainly try option 1 if required but option 2 would be > preferred. Generally people with sizable clusters regard stability > above all else and lead times to pushing new software releases on can be > extensive. > > > You could make it a --weird-hack > > (exit-nukes-everything?) specifically for this case. > > There is already a command line option to allow this thread to be > created so extending it to cover this shouldn't cause any problems. > > I assume Valgrind itself runs in the same thread as (and shared fd's > with) the main thread of the host program? > > > Or perhaps the alternative is to explicitly get the elan thread to > > terminate as part of the programs cleanup/shutdown actions (ie, do the > > appropriate call in MPI_Finalize). > > This isn't a very good solution, it would rely on programs being > "well-behaved" for V to run correctly. There is no equivalent for shmem > (the CRAY api) and then elan_fini() function is only used in a > scattering of cases. > > Ashley, > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers |