|
From: Jeremy F. <je...@go...> - 2003-05-26 16:28:01
|
On Mon, 2003-05-26 at 02:42, Josef Weidendorfer wrote:
> currently I'm thinking a little bit of what would be needed to allow
> applications run under Valgrind to use processors in parallel. The main goal
> would be to speed up cache simulation for multithreaded applications, more
> specially first to let OpenMP apps (number crunshing) run simultaneously.
> I'm not at all convinced if there will be any benefit/speedup at all on
> multiple processors because of a possible need for additional fine-grained
> communication among the threads.
I've been thinking about this too. I think the real difficulty is in
the skins rather than the core. The core only has occasional relatively
large chunks of work to do (ie, translate a basic block). It would be
fairly easy to make a translation hash miss do the right things (you'd
probably do it locklessly, so that the hash update is an atomic
operation; if two threads happen to want the same basic block at the
same time, they'd both translate it, but one would win and the other
would be thrown away).
The real problem is that the skins want to do data structure updates on
an instruction by instruction level. Nick mentioned memcheck; helgrind
is an even more extreme example, since it actually cares a lot about the
program's precise thread and lock behaviour, and what threads touch
which memory in what order.
The only reasonable way I can see to implement it would be to generate
inline atomic operations, rather than using mutexes. Unfortunately I
think that would still have an extreme amount of overhead; probably
enough to overwhelm any possible performance benefit of multiple CPUs.
There would also be a some memory overhead, since you'd have to include
space for locks in the data, though you could choose the density as a
tradeoff between memory use and concurrency.
A much more complex, but perhaps efficient way to do it, would be to
make all skins which care about keeping per-byte memory metadata behave
more like helgrind. That is, have the skin classify heap memory as
being "per-thread" or "shared". Per-thread memory+metadata could be
handled without any locking. As soon as another thread touches it, you
would convert it to "shared", which requires locked access. Handling
this transition would be tricky, as would handling the codegen issues
(would you generate all memory accesses as if they could be shared, or
would you regenerate those memory accesses which turn out to be
shared?). The problem with this approach, like so many other possible
"optimisations" for Valgrind, is that the overhead of all the
bookkeeping could easily remove any benefit.
Of course, skins which don't keep per-byte metadata about memory
(cachegrind, vgprof, etc) can just keep everything per-thread and
reconcile at the end.
The other killer is that it would make writing Valgrind itself and skins
a lot more complex. Valgrind is hard enough to get right as it is;
adding concurrency would simply make the tool itself a lot less
trustworthy.
> * Signal handling?
Signal handling+threads = <shudder>
> * What's with Valgrinds version of the pthread library? Do you think that it's
> a big task to make this reentrant-safe? Or perhaps we even could get rid of
> our own implementation?
I think we if we do go this path, then we can take a step back. Rather
than emulating threading at the pthread level, we can emulate it at the
clone system call level, and therefore allow any user-space pthreads
implementation. We may still want to intercept pthreads library calls
so that we have a better idea of what the program is actually trying to
achieve.
J
|