You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
(122) |
Nov
(152) |
Dec
(69) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(6) |
Feb
(25) |
Mar
(73) |
Apr
(82) |
May
(24) |
Jun
(25) |
Jul
(10) |
Aug
(11) |
Sep
(10) |
Oct
(54) |
Nov
(203) |
Dec
(182) |
| 2004 |
Jan
(307) |
Feb
(305) |
Mar
(430) |
Apr
(312) |
May
(187) |
Jun
(342) |
Jul
(487) |
Aug
(637) |
Sep
(336) |
Oct
(373) |
Nov
(441) |
Dec
(210) |
| 2005 |
Jan
(385) |
Feb
(480) |
Mar
(636) |
Apr
(544) |
May
(679) |
Jun
(625) |
Jul
(810) |
Aug
(838) |
Sep
(634) |
Oct
(521) |
Nov
(965) |
Dec
(543) |
| 2006 |
Jan
(494) |
Feb
(431) |
Mar
(546) |
Apr
(411) |
May
(406) |
Jun
(322) |
Jul
(256) |
Aug
(401) |
Sep
(345) |
Oct
(542) |
Nov
(308) |
Dec
(481) |
| 2007 |
Jan
(427) |
Feb
(326) |
Mar
(367) |
Apr
(255) |
May
(244) |
Jun
(204) |
Jul
(223) |
Aug
(231) |
Sep
(354) |
Oct
(374) |
Nov
(497) |
Dec
(362) |
| 2008 |
Jan
(322) |
Feb
(482) |
Mar
(658) |
Apr
(422) |
May
(476) |
Jun
(396) |
Jul
(455) |
Aug
(267) |
Sep
(280) |
Oct
(253) |
Nov
(232) |
Dec
(304) |
| 2009 |
Jan
(486) |
Feb
(470) |
Mar
(458) |
Apr
(423) |
May
(696) |
Jun
(461) |
Jul
(551) |
Aug
(575) |
Sep
(134) |
Oct
(110) |
Nov
(157) |
Dec
(102) |
| 2010 |
Jan
(226) |
Feb
(86) |
Mar
(147) |
Apr
(117) |
May
(107) |
Jun
(203) |
Jul
(193) |
Aug
(238) |
Sep
(300) |
Oct
(246) |
Nov
(23) |
Dec
(75) |
| 2011 |
Jan
(133) |
Feb
(195) |
Mar
(315) |
Apr
(200) |
May
(267) |
Jun
(293) |
Jul
(353) |
Aug
(237) |
Sep
(278) |
Oct
(611) |
Nov
(274) |
Dec
(260) |
| 2012 |
Jan
(303) |
Feb
(391) |
Mar
(417) |
Apr
(441) |
May
(488) |
Jun
(655) |
Jul
(590) |
Aug
(610) |
Sep
(526) |
Oct
(478) |
Nov
(359) |
Dec
(372) |
| 2013 |
Jan
(467) |
Feb
(226) |
Mar
(391) |
Apr
(281) |
May
(299) |
Jun
(252) |
Jul
(311) |
Aug
(352) |
Sep
(481) |
Oct
(571) |
Nov
(222) |
Dec
(231) |
| 2014 |
Jan
(185) |
Feb
(329) |
Mar
(245) |
Apr
(238) |
May
(281) |
Jun
(399) |
Jul
(382) |
Aug
(500) |
Sep
(579) |
Oct
(435) |
Nov
(487) |
Dec
(256) |
| 2015 |
Jan
(338) |
Feb
(357) |
Mar
(330) |
Apr
(294) |
May
(191) |
Jun
(108) |
Jul
(142) |
Aug
(261) |
Sep
(190) |
Oct
(54) |
Nov
(83) |
Dec
(22) |
| 2016 |
Jan
(49) |
Feb
(89) |
Mar
(33) |
Apr
(50) |
May
(27) |
Jun
(34) |
Jul
(53) |
Aug
(53) |
Sep
(98) |
Oct
(206) |
Nov
(93) |
Dec
(53) |
| 2017 |
Jan
(65) |
Feb
(82) |
Mar
(102) |
Apr
(86) |
May
(187) |
Jun
(67) |
Jul
(23) |
Aug
(93) |
Sep
(65) |
Oct
(45) |
Nov
(35) |
Dec
(17) |
| 2018 |
Jan
(26) |
Feb
(35) |
Mar
(38) |
Apr
(32) |
May
(8) |
Jun
(43) |
Jul
(27) |
Aug
(30) |
Sep
(43) |
Oct
(42) |
Nov
(38) |
Dec
(67) |
| 2019 |
Jan
(32) |
Feb
(37) |
Mar
(53) |
Apr
(64) |
May
(49) |
Jun
(18) |
Jul
(14) |
Aug
(53) |
Sep
(25) |
Oct
(30) |
Nov
(49) |
Dec
(31) |
| 2020 |
Jan
(87) |
Feb
(45) |
Mar
(37) |
Apr
(51) |
May
(99) |
Jun
(36) |
Jul
(11) |
Aug
(14) |
Sep
(20) |
Oct
(24) |
Nov
(40) |
Dec
(23) |
| 2021 |
Jan
(14) |
Feb
(53) |
Mar
(85) |
Apr
(15) |
May
(19) |
Jun
(3) |
Jul
(14) |
Aug
(1) |
Sep
(57) |
Oct
(73) |
Nov
(56) |
Dec
(22) |
| 2022 |
Jan
(3) |
Feb
(22) |
Mar
(6) |
Apr
(55) |
May
(46) |
Jun
(39) |
Jul
(15) |
Aug
(9) |
Sep
(11) |
Oct
(34) |
Nov
(20) |
Dec
(36) |
| 2023 |
Jan
(79) |
Feb
(41) |
Mar
(99) |
Apr
(169) |
May
(48) |
Jun
(16) |
Jul
(16) |
Aug
(57) |
Sep
(19) |
Oct
|
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
1
(17) |
2
(15) |
3
(36) |
4
(24) |
5
(36) |
|
6
(18) |
7
(16) |
8
(18) |
9
(19) |
10
(18) |
11
(37) |
12
(18) |
|
13
(13) |
14
(21) |
15
(27) |
16
(10) |
17
(16) |
18
(25) |
19
(21) |
|
20
(11) |
21
(14) |
22
(6) |
23
(15) |
24
(27) |
25
(3) |
26
(9) |
|
27
(16) |
28
(24) |
29
(21) |
30
(43) |
31
(42) |
|
|
|
From: Jeremy F. <je...@go...> - 2005-03-19 23:31:54
|
Julian Seward wrote:
>>It takes care of
>>program termination... it is always the last thread to exit in
>>LinuxThreads programs?
>>
>>
>
>I don't know (I think so); but I also don't think this is relevant.
>
>Reason why not is that presumably it would be possible to construct
>a test case which deadlocks under the current behaviour even when
>running under NPTL. I don't think it's particularly specific to
>LinuxThreads.
>
>
This isn't something you could do from application code. The getppid()
library call will always return the main thread's parent from every
thread (likewise, getpid() always returns the same pid in all threads).
We're talking about the behaviour of the getppid() system call, which is
only directly used by library code. So this is specific to
LinuxThreads, because its the only thread library we know of to use the
getppid syscall in this way.
This attached patch fixes getppid(), BTW, but it does expose another
problem: LinuxThreads is using SIGKILL to kill off the other threads,
which means that they don't get a chance to clean up or notify anyone of
their death. This would cause problems with either shutdown-and-report
method. I'm thinking about a fix.
This is one of the cases where the requirements of Valgrind-as-emulation
environment conflict with the requrements of
Valgrind-as-test/debug-tool. Perfect emulation would require the
last-one-out model, but that makes the debug tool less useful, since the
reasonable user expectation that all results will be reported before
exit will be violated. Given that this is a very small corner case, and
is reasonably easy to fix, I would prefer to retain the most useful
behaviour for a test/debug tool than for an emulation tool, and fix
problems with that on a case-by-case basis. The alternative means that
users can never expect synchronous results from MT programs, which makes
every single invocation of Valgrind more difficult; this doesn't seem
like a good tradeoff.
J
|
|
From: Julian S. <js...@ac...> - 2005-03-19 18:26:53
|
> - The whole issue is caused by the fact that Valgrind must do some extra > stuff (do its final output) after all the client threads have finished. > The question is, which thread should do this extra stuff? Yes. > - The current "master thread" approach is to anoint a particular thread as > special, and have it hang around until termination to do this extra stuff. > This artificial life-extension of the master thread can cause deadlocks. > This is the OOo problem. Yes. > - The alternative "last one out turn off the lights" approach is to get > the last-exiting thread to do the extra stuff. This can cause programs > that use wait() to not wait until the Valgrind'd client has done its extra > stuff. Yes. > - Another possibility is for Valgrind to have a manager thread, but that > complicates signals and stuff horribly. Yes. > - "Parent" and "children" threads come about via clone(), ie. a parent > thread calls clone(), producing a child thread Yes. > - The "main thread" is an OS/pthreads(?) concept. It's thread 1, I think. > It's the one that main() executes in, and it is the parent (or > grandparent, or great-grandparent...) of every other thread in the client. > Other than that, it's not so special; other threads can outlive it, for > example. I believe that to all be correct, but I'm not sure. > (Or not? Jeremy just said pthreads is defined so that all > threads die when main() exits... I'm confused. My understanding is that POSIX requires: when main() exits then any child pthreads simply disappear. But I think that's kinda not relevant here. > - The "master thread" is a Valgrind concept. Currently, Valgrind keeps it > alive in order to do Valgrind's final processing. Yes > It starts off as thread 1, Yes > but when fork() is called it can change in the children. That may or may not be true, but I don't think it's relevant. > The life-extension can change the order in which threads terminate. Yes. And hence cause deadlock. > - The "manager thread" is present in LinuxThreads. Yes. > It takes care of > program termination... it is always the last thread to exit in > LinuxThreads programs? I don't know (I think so); but I also don't think this is relevant. Reason why not is that presumably it would be possible to construct a test case which deadlocks under the current behaviour even when running under NPTL. I don't think it's particularly specific to LinuxThreads. > - wait() is a big complication -- if we're not careful, programs using > wait() will screw up. I'm not at all clear about which thread wait() > watches -- the main thread? I guess it's whichever pid it is passed, and > the trickiness here is that under Linux a pid is really a tid? > > (Doesn't this mean that it's true that for any program, wait() would > return before the program is finished if any thread outlived the > watched-by-wait() thread?) Correct. But that's not a big deal, because that's what would happen natively. The problem arises -- in the last_one_out scheme -- when one assumes that observation of the watched-by-wait thread exiting reliably means that V's final summary is done. > Is there any way to fix the wait() problem while keeping > last-one-out-turn-off-the-lights? Fundamentally, no. J |
|
From: Jeremy F. <je...@go...> - 2005-03-19 18:15:42
|
Nicholas Nethercote wrote:
> I'd like to step back for a second in order to understand this problem
> better, because I've found this discussion confusing. I've stated
> what are the salient facts here, as I understand them. Please correct
> any that are wrong, or add any important missing ones.
>
> - The whole issue is caused by the fact that Valgrind must do some
> extra stuff (do its final output) after all the client threads have
> finished.
> The question is, which thread should do this extra stuff?
Yes.
> - The current "master thread" approach is to anoint a particular
> thread as special, and have it hang around until termination to do
> this extra stuff.
> This artificial life-extension of the master thread can cause deadlocks.
> This is the OOo problem.
Yes (though its a generic older-LinuxThread issue).
> - The alternative "last one out turn off the lights" approach is to
> get the last-exiting thread to do the extra stuff. This can cause
> programs that use wait() to not wait until the Valgrind'd client has
> done its extra stuff.
Yes.
> - Another possibility is for Valgrind to have a manager thread, but
> that complicates signals and stuff horribly.
Yes; it wouldn't be viable.
> - "Parent" and "children" threads come about via clone(), ie. a parent
> thread calls clone(), producing a child thread
Mostly true. NPTL-style clone, with CLONE_PID, means that threads don't
really have parent-child relationships. But in the LinuxThreads use of
clone(), they do. NPTL threading is a non-issue for this conversation,
because it uses exit_group() to atomically terminates all threads in the
process, and so the "last thread" problem doesn't arise.
> - The "main thread" is an OS/pthreads(?) concept. It's thread 1, I
> think. It's the one that main() executes in, and it is the parent (or
> grandparent, or great-grandparent...) of every other thread in the
> client. Other than that, it's not so special; other threads can
> outlive it, for example. (Or not? Jeremy just said pthreads is
> defined so that all threads die when main() exits... I'm confused.
> Perhaps we're being imprecise with our usage of words like "dying" and
> "exiting"? Or is it that what one sees at the application level can
> be different to what is actually happening inside the pthreads library?))
Valgrind deals with kernel-level threads; they're the only entity which
can see being created or destroyed. The kernel doesn't impose any
particular constrains on the lifetime of one thread with respect to
other threads.
The application has a threads library to do threading; most likely it is
pthreads, but since we don't really know or care what the threads
library is, it could be something else. pthreads says that everything
dies when the main thread terminates, but that's a pthreads policy
rather than a kernel one.
Julian is maintaining that it is an application bug if they leave
sub-threads running after the main thread terminates. My counter is
that 1) it isn't really a bug, and 2) the application/programmer can
only use their threads API to control this; they have no control over
how the implementation of that API manages the underlying kernel
resources, including their lifetime. It's therefore unfair to blame the
programmer for inconvenient exit-ordering of the kernel threads.
> - The "master thread" is a Valgrind concept. Currently, Valgrind
> keeps it alive in order to do Valgrind's final processing. It starts
> off as thread 1, but when fork() is called it can change in the
> children. The life-extension can change the order in which threads
> terminate.
Well, in particular, we keep the kernel thread around while pretending
it has died. This confuses a particular LinuxThread implementation,
because it can see via getppid() that the kernel-level thread still
exists even though it has logically terminated.
> - The "manager thread" is present in LinuxThreads. It takes care of
> program termination... it is always the last thread to exit in
> LinuxThreads programs?
No. That's part of the bug in this case. The manager is trying to work
out whether it is last or not; if the parent (main thread) still exists,
its expecting it to shut everything down.
> - wait() is a big complication -- if we're not careful, programs using
> wait() will screw up. I'm not at all clear about which thread wait()
> watches -- the main thread? I guess it's whichever pid it is passed,
> and the trickiness here is that under Linux a pid is really a tid?
The main thread. The program which started the threaded one doesn't
know it will be threaded; it just has a single pid to look at.
> (Doesn't this mean that it's true that for any program, wait() would
> return before the program is finished if any thread outlived the
> watched-by-wait() thread?)
Yes, but that only matters if the other threads do something (have
side-effects) after the main thread terminates. Under Valgrind, threads
which would have otherwise done nothing would be responsible for
emitting output, and therefore do something.
> Is there any way to fix the wait() problem while keeping
> last-one-out-turn-off-the-lights?
I don't believe so. The other point to note that fixing the current
implementation of getppid to conform to LinuxThread's expectations is
~10 lines of code.
J
|
|
From: Nicholas N. <nj...@cs...> - 2005-03-19 17:19:04
|
On Sat, 19 Mar 2005, Julian Seward wrote: >> 2. produce output from whichever is the last thread to exit > >> 3. make the main thread wait for all others to exit before producing >> output > > On further contemplation ... I'd like to step back for a second in order to understand this problem better, because I've found this discussion confusing. I've stated what are the salient facts here, as I understand them. Please correct any that are wrong, or add any important missing ones. - The whole issue is caused by the fact that Valgrind must do some extra stuff (do its final output) after all the client threads have finished. The question is, which thread should do this extra stuff? - The current "master thread" approach is to anoint a particular thread as special, and have it hang around until termination to do this extra stuff. This artificial life-extension of the master thread can cause deadlocks. This is the OOo problem. - The alternative "last one out turn off the lights" approach is to get the last-exiting thread to do the extra stuff. This can cause programs that use wait() to not wait until the Valgrind'd client has done its extra stuff. - Another possibility is for Valgrind to have a manager thread, but that complicates signals and stuff horribly. - "Parent" and "children" threads come about via clone(), ie. a parent thread calls clone(), producing a child thread - The "main thread" is an OS/pthreads(?) concept. It's thread 1, I think. It's the one that main() executes in, and it is the parent (or grandparent, or great-grandparent...) of every other thread in the client. Other than that, it's not so special; other threads can outlive it, for example. (Or not? Jeremy just said pthreads is defined so that all threads die when main() exits... I'm confused. Perhaps we're being imprecise with our usage of words like "dying" and "exiting"? Or is it that what one sees at the application level can be different to what is actually happening inside the pthreads library?)) - The "master thread" is a Valgrind concept. Currently, Valgrind keeps it alive in order to do Valgrind's final processing. It starts off as thread 1, but when fork() is called it can change in the children. The life-extension can change the order in which threads terminate. - The "manager thread" is present in LinuxThreads. It takes care of program termination... it is always the last thread to exit in LinuxThreads programs? - wait() is a big complication -- if we're not careful, programs using wait() will screw up. I'm not at all clear about which thread wait() watches -- the main thread? I guess it's whichever pid it is passed, and the trickiness here is that under Linux a pid is really a tid? (Doesn't this mean that it's true that for any program, wait() would return before the program is finished if any thread outlived the watched-by-wait() thread?) ---- Is there any way to fix the wait() problem while keeping last-one-out-turn-off-the-lights? N |
|
From: Jeremy F. <je...@go...> - 2005-03-19 17:18:27
|
Julian Seward wrote:
> We need 2 because 3 potentially deadlocks, and de-deadlocking it
> by kludging syscall handlers doesn't have a good feel to it.
>
>
BTW, getppid() used in the way that LinuxThreads does is the only
generally safe/correct way to poll for the existence of a process. Any
program which does so with kill(pid, 0), or ps, or anything else, is
inherently buggy because once a thread/process exits, there's nothing to
stop that pid from being reused. So I think getppid() is the only case
we need concern ourselves with.
J
|
|
From: Jeremy F. <je...@go...> - 2005-03-19 17:06:16
|
Julian Seward wrote:
>On further contemplation ...
>
>Maybe we need both 2 and 3, with 2 being the default and 3 selected
>by a flag:
>
> We need 3 because automated test suites (including ours) require
> that to work reliably.
>
> We need 2 because 3 potentially deadlocks, and de-deadlocking it
> by kludging syscall handlers doesn't have a good feel to it.
>
>
A big user-visible mode switch like this would be pretty complex to
implement and document, as well as being an ongoing maintenance burden.
Having a mode which is only used for regtests means that the other mode
will not be tested, despite being the mode we expect people to use
regularly. By comparison, the code needed to make getppid() do what old
LinuxThreads libraries expect is pretty simple.
>I thought also about restructuring the system so that there is
>one manager thread, which doesn't run any client code, and N
>worker threads, which do. The manager thread is the one that
>exists when V is started. It can wait for all the worker threads
>to exit before printing the final summary, but it does not delay
>any worker thread from exiting.
>
>That would solve the problem cleanly. But then it complicates
>signal routing; signals directed from outside to the manager
>thread would have to be rerouted to the first worker thread.
>
>
Yes, the signal routing problems are tough to handle.
It wouldn't solve this particular problem either, really. If the
application's main thread does a getppid(), then it would expect to see
its parent pid, not the manager pid; we would need to change getppid to
fake it (and of course getpid() would need changing too). We would need
to handle and redirect all the cases where a pid is passed to a syscall
(particularly where pids have a special role in relation to terminal
management and job control). Any kind of redirecting/virtualizing pids
is very complex.
There's also the problem of not knowing how to actually create a thread
on a given system; the ProxyLWP code had some hacks, but the current
code solves it cleanly by never trying to create its own threads; it
just does what the app asks it to do (and there's a corresponding
problem of not knowing how to wait for thread termination, since that
depends on how it was created).
J
|
|
From: Jeremy F. <je...@go...> - 2005-03-19 16:31:01
|
Julian Seward wrote:
>True, but only for poorly designed threaded programs in which the
>initial thread does not wait for its children with pthread_join
>before exiting. I'm happy to tell users that if they want to do this
>then results will appear asynchronously relative to main-thread exit.
>
>
No, it isn't something which is under the application's control.
pthreads is defined so that all threads die when main() returns anyway;
the library is responsible for cleaning up all the threads, but there
are no guarentees about what order they exit in.
Even if the program calls pthread_join() on all its threads, that still
doesn't control the exact order the kernel-level threads exit
(pthread_join returning just means the thread is no longer running
application code). The extreme - but quite common - case is where the
threads library keeps its kernel threads in a pool for recycling; in
this case the application thread lifetime has no relation to the
underlying thread lifetime.
>Besides, with (2) it's easy to emit a warning message saying "main
>thread has exited but children are alive; results will be delayed
>until the last child exits (and/or your program is ill-structured;
>fix it)".
>
>
I think that would be pushing a burden out to our users which is really
Valgrind's to bear. There's no inherent bug in programs in which the
main thread doesn't exit last.
J
|
|
From: Julian S. <js...@ac...> - 2005-03-19 12:10:57
|
> 2. produce output from whichever is the last thread to exit > 3. make the main thread wait for all others to exit before producing > output On further contemplation ... Maybe we need both 2 and 3, with 2 being the default and 3 selected by a flag: We need 3 because automated test suites (including ours) require that to work reliably. We need 2 because 3 potentially deadlocks, and de-deadlocking it by kludging syscall handlers doesn't have a good feel to it. ------- I thought also about restructuring the system so that there is one manager thread, which doesn't run any client code, and N worker threads, which do. The manager thread is the one that exists when V is started. It can wait for all the worker threads to exit before printing the final summary, but it does not delay any worker thread from exiting. That would solve the problem cleanly. But then it complicates signal routing; signals directed from outside to the manager thread would have to be rerouted to the first worker thread. Basically it breaks the assumption that 'the pid of the subprocess I just started is the same as the pid of its initial thread'. So it's not really a solution. I guess there is value in mimicing exactly the thread structure of programs running natively. J |
|
From: Julian S. <js...@ac...> - 2005-03-19 11:40:07
|
> >So the only solution is, once a thread asks to exit, we must let it > >exit in finite time and without further reference to the state of any > >other thread. Doing otherwise breaks liveness. And so that puts the > >tin hat on the master-thread scheme. > > That's all true, as far as it goes, but it does open another problem: > the output will appear at some arbitrary time after the main thread exits. Yes, that's clear. > This means that any external wait()er will think the process has > completed before the output has appeared. In the simple case, your > shell prompt may return before the Valgrind results have been output; > more generally, any test infrastructure which incorporates valgrind > (including make regtest) can't reliably know when the full results are > available. True, but only for poorly designed threaded programs in which the initial thread does not wait for its children with pthread_join before exiting. I'm happy to tell users that if they want to do this then results will appear asynchronously relative to main-thread exit. > 2 avoids this kind of internal deadlock, but we only know of one > example, and it can be fixed by finessing the behaviour of getppid. The > other likely mechanism for detecting a thread's existence is > [t]kill(pid, 0) These solutions add special-case checks that in the end increase our maintenance and portability burden. We should move to option (2) and see what happens. If the user base protests loudly enough we can then consider a workaround, but we shouldn't dismiss (2) until we know how it plays in practice. Besides, with (2) it's easy to emit a warning message saying "main thread has exited but children are alive; results will be delayed until the last child exits (and/or your program is ill-structured; fix it)". J |
|
From: Jeremy F. <je...@go...> - 2005-03-19 04:53:59
|
Julian Seward wrote:
>>OK, I think I've tracked it down. It seems to be a bug in getppid(), of
>>all things. In SuSE LinuxThreads, the manager thread seems to be using
>>getppid() to see if the main thread has exited already or not; if it has
>>getppid() will return 1, otherwise it will return a non-1 pid.
>>
>>
>
>On contemplation, I think the only proper fix to this is to move to a
>last-one-out-turn-off-the-lights scheme.
>
>Consider threads t1 and t2, where t1 is the master thread. t1 asks
>to exit, but is kept alive because t2 is still alive. t2 is waiting
>to observe t1's disappearance before proceeding. So we are
>deadlocked.
>
>No amount of messing with getppid will fix this properly. Reason is
>that t2 has an arbitrary number of ways of detecting whether t1 is
>still alive; getppid is just one of them. For example, t2 could be
>running 'ps' on the side and reading the results. So we can't possibly
>hope to intercept all those methods of observation.
>
>So the only solution is, once a thread asks to exit, we must let it
>exit in finite time and without further reference to the state of any
>other thread. Doing otherwise breaks liveness. And so that puts the
>tin hat on the master-thread scheme.
>
>
That's all true, as far as it goes, but it does open another problem:
the output will appear at some arbitrary time after the main thread exits.
This means that any external wait()er will think the process has
completed before the output has appeared. In the simple case, your
shell prompt may return before the Valgrind results have been output;
more generally, any test infrastructure which incorporates valgrind
(including make regtest) can't reliably know when the full results are
available.
So, my design choices were:
1. generate what ever output was available synchronous with thread
1's exit
2. produce output from whichever is the last thread to exit
3. make the main thread wait for all others to exit before producing
output
1 is an obvious non-starter.
So, between 2 and 3, 3 seemed like the most useful behaviour, and least
surprising to users.
2 avoids this kind of internal deadlock, but we only know of one
example, and it can be fixed by finessing the behaviour of getppid. The
other likely mechanism for detecting a thread's existence is
[t]kill(pid, 0), which sends no signal but returns an error if the
target pid doesn't exist. We could handle this too, if it were a
problem. Using ps to poll, etc, are all possible, but pretty unlikely;
there are lots of analogous cases where programs can pierce the
emulation layer by poking around in /proc, but in practice we've only
had to handle a couple of these cases.
J
|
|
From: Jeremy F. <je...@go...> - 2005-03-19 04:21:33
|
Julian Seward wrote:
>>OK, I think I've tracked it down. It seems to be a bug in getppid(), of
>>all things. In SuSE LinuxThreads, the manager thread seems to be using
>>getppid() to see if the main thread has exited already or not; if it has
>>getppid() will return 1, otherwise it will return a non-1 pid. In the
>>Valgrind case, the parent thread exists even after it has exited from
>>running application code, so the manager thinks it still exists. I'm
>>not quite sure what the sequence of events is, but I guess the manager
>>thinks that the main thread is going to shut everything down rather than
>>doing it itself. Or something.
>>
>>
>
>Well found. That sounds like a problem induced by the fact that
>under V the threads finish in a different order than they would normally
>due to the master thread staying alive until all its children have
>exited. Perhaps it would be more robust to use the scheme that the
>last thread to exit should produce all the output?
>
>
That's what I'm thinking about. The disadvantage is that if someone
else is wait()ing on the process, they might see the process terminate
before it has actually produced all its output (think of the regtest
machinery).
The other more hacky approach is to simply have a per-thread "I have
exited" flag; if getppid is about to return the pid of one of our
threads and it has that flag set, return 1 instead. I don't much like
that either.
>btw, the hang also happens on R H 7.3. So I guess it's not
>specific to SuSE's LinuxThreads.
>
>
Seems reasonable.
>>I haven't found a fix that I'm willing to apply at this point; I'd
>>rather ship 2.4.0 with this bug than apply a patch of more than a couple
>>of lines.
>>
>>
>
>OK. Unless anyone wildly objects, I'll make the final tarball, post
>announcements etc and we can ship a fix to this in 2.4.1.
>
I have a couple of things I haven't checked in yet. Well, I just
checked in the .spec file URL update, but I have a stabs parser fix
which I want to exercise a little first. Tomorrow (ie, in about 24 hours)?
J
|
|
From: Jeremy F. <je...@go...> - 2005-03-19 04:16:44
|
CVS commit by fitzhardinge: Update URL in rpm spec file. M +1 -1 valgrind.spec.in 1.25 --- valgrind/valgrind.spec.in #1.24:1.25 @@ -5,5 +5,5 @@ Epoch: 1 License: GPL -URL: http://valgrind.kde.org +URL: http://www.valgrind.org/ Group: Development/Debuggers Packager: Jeremy Fitzhardinge <je...@go...> |
|
From: <js...@ac...> - 2005-03-19 04:03:05
|
Nightly build on phoenix ( SuSE 9.1 ) started at 2005-03-19 03:50:00 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow insn_mmx: valgrind ./insn_mmx insn_mmxext: (skipping, prereq failed: ../../../tests/cputest x86-mmxext) insn_sse: valgrind ./insn_sse insn_sse2: (skipping, prereq failed: ../../../tests/cputest x86-sse2) int: valgrind ./int pushpopseg: valgrind ./pushpopseg rcl_assert: valgrind ./rcl_assert seg_override: valgrind ./seg_override -- Finished tests in none/tests/x86 ------------------------------------ yield: valgrind ./yield -- Finished tests in none/tests ---------------------------------------- == 201 tests, 5 stderr failures, 0 stdout failures ================= memcheck/tests/pth_once (stderr) memcheck/tests/scalar (stderr) memcheck/tests/threadederrno (stderr) memcheck/tests/writev (stderr) corecheck/tests/fdleak_fcntl (stderr) make: *** [regtest] Error 1 |
|
From: Julian S. <js...@ac...> - 2005-03-19 03:35:40
|
> OK, I think I've tracked it down. It seems to be a bug in getppid(), of > all things. In SuSE LinuxThreads, the manager thread seems to be using > getppid() to see if the main thread has exited already or not; if it has > getppid() will return 1, otherwise it will return a non-1 pid. On contemplation, I think the only proper fix to this is to move to a last-one-out-turn-off-the-lights scheme. Consider threads t1 and t2, where t1 is the master thread. t1 asks to exit, but is kept alive because t2 is still alive. t2 is waiting to observe t1's disappearance before proceeding. So we are deadlocked. No amount of messing with getppid will fix this properly. Reason is that t2 has an arbitrary number of ways of detecting whether t1 is still alive; getppid is just one of them. For example, t2 could be running 'ps' on the side and reading the results. So we can't possibly hope to intercept all those methods of observation. So the only solution is, once a thread asks to exit, we must let it exit in finite time and without further reference to the state of any other thread. Doing otherwise breaks liveness. And so that puts the tin hat on the master-thread scheme. J |
|
From: Tom H. <to...@co...> - 2005-03-19 03:28:43
|
Nightly build on dunsmere ( Fedora Core 3 ) started at 2005-03-19 03:20:03 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow insn_cmov: valgrind ./insn_cmov insn_fpu: valgrind ./insn_fpu insn_mmx: valgrind ./insn_mmx insn_mmxext: valgrind ./insn_mmxext insn_sse: valgrind ./insn_sse insn_sse2: (skipping, prereq failed: ../../../tests/cputest x86-sse2) int: valgrind ./int sh: line 1: 15330 Segmentation fault VALGRINDLIB=/tmp/valgrind.22607/valgrind/.in_place /tmp/valgrind.22607/valgrind/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --addrcheck:leak-check=no --tool=none ./int >int.stdout.out 2>int.stderr.out pushpopseg: valgrind ./pushpopseg rcl_assert: valgrind ./rcl_assert seg_override: valgrind ./seg_override -- Finished tests in none/tests/x86 ------------------------------------ yield: valgrind ./yield -- Finished tests in none/tests ---------------------------------------- == 207 tests, 2 stderr failures, 0 stdout failures ================= memcheck/tests/scalar (stderr) memcheck/tests/scalar_supp (stderr) make: *** [regtest] Error 1 |
|
From: Tom H. <th...@cy...> - 2005-03-19 03:24:12
|
Nightly build on audi ( Red Hat 9 ) started at 2005-03-19 03:15:02 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow cpuid: valgrind ./cpuid dastest: valgrind ./dastest fpu_lazy_eflags: valgrind ./fpu_lazy_eflags insn_basic: valgrind ./insn_basic insn_cmov: valgrind ./insn_cmov insn_fpu: valgrind ./insn_fpu insn_mmx: valgrind ./insn_mmx insn_mmxext: valgrind ./insn_mmxext insn_sse: valgrind ./insn_sse insn_sse2: (skipping, prereq failed: ../../../tests/cputest x86-sse2) int: valgrind ./int pushpopseg: valgrind ./pushpopseg rcl_assert: valgrind ./rcl_assert seg_override: valgrind ./seg_override -- Finished tests in none/tests/x86 ------------------------------------ yield: valgrind ./yield -- Finished tests in none/tests ---------------------------------------- == 206 tests, 0 stderr failures, 0 stdout failures ================= |
|
From: Tom H. <th...@cy...> - 2005-03-19 03:16:39
|
Nightly build on ginetta ( Red Hat 8.0 ) started at 2005-03-19 03:10:02 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow insn_basic: valgrind ./insn_basic insn_cmov: valgrind ./insn_cmov insn_fpu: valgrind ./insn_fpu insn_mmx: valgrind ./insn_mmx insn_mmxext: valgrind ./insn_mmxext insn_sse: valgrind ./insn_sse insn_sse2: (skipping, prereq failed: ../../../tests/cputest x86-sse2) int: valgrind ./int pushpopseg: valgrind ./pushpopseg rcl_assert: valgrind ./rcl_assert seg_override: valgrind ./seg_override -- Finished tests in none/tests/x86 ------------------------------------ yield: valgrind ./yield -- Finished tests in none/tests ---------------------------------------- == 205 tests, 2 stderr failures, 0 stdout failures ================= memcheck/tests/pth_once (stderr) memcheck/tests/threadederrno (stderr) make: *** [regtest] Error 1 |
|
From: Tom H. <th...@cy...> - 2005-03-19 03:15:33
|
Nightly build on standard ( Red Hat 7.2 ) started at 2005-03-19 03:00:02 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow insn_mmx: valgrind ./insn_mmx insn_mmxext: valgrind ./insn_mmxext insn_sse: valgrind ./insn_sse insn_sse2: (skipping, prereq failed: ../../../tests/cputest x86-sse2) int: valgrind ./int pushpopseg: valgrind ./pushpopseg rcl_assert: valgrind ./rcl_assert seg_override: valgrind ./seg_override -- Finished tests in none/tests/x86 ------------------------------------ yield: valgrind ./yield -- Finished tests in none/tests ---------------------------------------- == 205 tests, 5 stderr failures, 0 stdout failures ================= memcheck/tests/leak-tree (stderr) memcheck/tests/pth_once (stderr) memcheck/tests/threadederrno (stderr) memcheck/tests/vgtest_ume (stderr) addrcheck/tests/leak-tree (stderr) make: *** [regtest] Error 1 |
|
From: Tom H. <th...@cy...> - 2005-03-19 03:11:37
|
Nightly build on alvis ( Red Hat 7.3 ) started at 2005-03-19 03:05:02 GMT Checking out source tree ... done Configuring ... done Building ... done Running regression tests ... done Last 20 lines of log.verbose follow == 205 tests, 16 stderr failures, 0 stdout failures ================= memcheck/tests/addressable (stderr) memcheck/tests/describe-block (stderr) memcheck/tests/distinguished-writes (stderr) memcheck/tests/leak-0 (stderr) memcheck/tests/leak-cycle (stderr) memcheck/tests/leak-regroot (stderr) memcheck/tests/leak-tree (stderr) memcheck/tests/match-overrun (stderr) memcheck/tests/pointer-trace (stderr) memcheck/tests/pth_once (stderr) memcheck/tests/threadederrno (stderr) memcheck/tests/vgtest_ume (stderr) addrcheck/tests/leak-0 (stderr) addrcheck/tests/leak-cycle (stderr) addrcheck/tests/leak-regroot (stderr) addrcheck/tests/leak-tree (stderr) make: *** [regtest] Error 1 |
|
From: Julian S. <js...@ac...> - 2005-03-19 02:00:30
|
> OK, I think I've tracked it down. It seems to be a bug in getppid(), of > all things. In SuSE LinuxThreads, the manager thread seems to be using > getppid() to see if the main thread has exited already or not; if it has > getppid() will return 1, otherwise it will return a non-1 pid. In the > Valgrind case, the parent thread exists even after it has exited from > running application code, so the manager thinks it still exists. I'm > not quite sure what the sequence of events is, but I guess the manager > thinks that the main thread is going to shut everything down rather than > doing it itself. Or something. Well found. That sounds like a problem induced by the fact that under V the threads finish in a different order than they would normally due to the master thread staying alive until all its children have exited. Perhaps it would be more robust to use the scheme that the last thread to exit should produce all the output? btw, the hang also happens on R H 7.3. So I guess it's not specific to SuSE's LinuxThreads. > I haven't found a fix that I'm willing to apply at this point; I'd > rather ship 2.4.0 with this bug than apply a patch of more than a couple > of lines. OK. Unless anyone wildly objects, I'll make the final tarball, post announcements etc and we can ship a fix to this in 2.4.1. J |
|
From: Jeremy F. <je...@go...> - 2005-03-19 00:12:56
|
Julian Seward wrote:
>Great, thanks.
>
OK, I think I've tracked it down. It seems to be a bug in getppid(), of
all things. In SuSE LinuxThreads, the manager thread seems to be using
getppid() to see if the main thread has exited already or not; if it has
getppid() will return 1, otherwise it will return a non-1 pid. In the
Valgrind case, the parent thread exists even after it has exited from
running application code, so the manager thinks it still exists. I'm
not quite sure what the sequence of events is, but I guess the manager
thinks that the main thread is going to shut everything down rather than
doing it itself. Or something.
I haven't found a fix that I'm willing to apply at this point; I'd
rather ship 2.4.0 with this bug than apply a patch of more than a couple
of lines.
J
|