|
From: Philippe W. <phi...@sk...> - 2012-06-17 13:44:01
|
On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote: > > > * Memcheck. I suspect we will need to redesign and reimplement > > > > > > the shadow memory stuff (mc_main.c). It has many optimisations > > > both to increase speed and reduce memory consumption > > > (http://valgrind.org/docs/shadow-memory2007.pdf) and > > > I believe not all of them will work in a MT environment. > > > > > > The IR-level instrumentation stuff (mc_translate.c) should be > > > pretty much unchanged. > > > > > > IOW, Memcheck's helper functions are (very) racey, but the > > > generated instrumentation code is not (I think). > > > > > > One comment is (as you speculate) that we cannot afford to > > > do atomic operations for each helperc_* call -- those are > > > called once per memory reference of the original program, > > > and this would kill performance. > > > > On this, I am wondering if we really have a huge problem > > for the "fast memcheck case". > > Basically, as long as the memory is < 32 Gb and the bytes > > are fully initialised, V and A bits can (I believe) be maintained > > without any locking/protection: if the guest process has no > > race condition, then modifying V and A bits will also not be > > racey (is this really true ? > > I think this is probably true, not 100% sure though. For sure we > can never do 100% accurate simulation of compare-and-swap since > the shadow-CAS and real-CAS are necessarily non-atomic. But maybe > this doesn't matter much. > > > 1 byte shadows 4 bytes. So > > 4 threads modifying each one byte might not be racey, but > > the underlying V/A byte will be shared and racey). > > This is the main problem I identified, when looking at it. Yes. > > > If we cannot use atomic instruction and the above is not ok, > > I suspect we will have to go to a "one byte shadows one byte". > > Problem with that is, the current scheme is a big performance win > because it generates fewer cache misses. /me really doesn't want > to return to a 1:1 scheme. I experimented a little bit with memcheck. Finding a solution to the VA bits will be a real challenge. I did several trials (for the LOAD/STORE helperc only, as these are the critical ones, and only for the fast cases): * use mechanically mutex lock/unlock in all LOAD/STORE helperc (this was just for the fun/for reference/for desperating ourselves :). The perf is (of course) horrible : bz2 is about 4 times slower big_parallel_sleepers is xxx times slower (due to contention) totally unusable as expected. * just have an atomic instruction before/after the LOAD/STORE (so basically inlining the mutex fast case) bz2 is more than 2 times slower (did not measure big-parallel_sleepers, but for sure is perf is horrible, see below). * when just having one atomic instruction per LOAD/STORE (so assuming we can have a lock-less algorithm with one atomic instruction) bz2 is 55% slower * when just having one atomic instruction per STORE (assuming LOAD do not need atomic if the guest process has no race condition. This looks a reasonable assumption) bz2 is 4% slower big_parallel_sleepers is still horribly slower (3 times in real time, 10 times in cpu) Unclear to me why bz2 is not slowed down more: can the cpu detect that there is no other thread that can access the memory and so does "less atomic work" for bz2 ?) Or is it because big_parallel_sleepers is an horrible test case ? (this is just bigcode1 perf test but running in parallel 4 bigcode1). * when just having one atomic instruction per STORE8 (assuming the other STORE16/32/64 do not need atomicity if the guest process is race free) bz2 is about 3% slower big parallel sleepers speed is acceptable (15% more cpu, real time divided by 2). I am however not at all convinced that only protecting the STORE8 will avoid false positive (we could maybe tolerate a small rate of false negative when running in parallel). Even if protecting the STORE8 is good enough, then any parallel algorithm which works a lot with single bytes might be slowed down by a factor 10 or similar. So, at this stage, it looks unclear to me that there is any other approach than going for a 1-1 mapping for a parallel memcheck. Now, similarly to the "with origin" or "without origin", we might have helperc STORE/LOAD "with_1_1 mapping/without_1_1_mapping" helperc. (not necessarily linked to the scheduler policy. It might be that a 1-1 mapping could be faster in some cases than the current schema for "serial" scheduler policy). I start to believe that tt_fast and the transtab is "relatively easy" to parallelise (the prototype is already more or less ok, with a little bit of fine tuning, it should be acceptable) but that the real challenges are: memcheck VA bits is the real performance challenge. proper testing of all that Philippe |