|
From: Julian S. <js...@ac...> - 2012-06-17 11:08:59
|
On Friday, June 15, 2012, Philippe Waroquiers wrote: > The validation and subsequent reg-test is effectively critical. > On that side, we have two aspects: > * are the outer tools (helgrind/drd) ok ? > From what I can see, these tools are working quite well > on this problem. There are a few quirks (I will commit today > a small change to help outer helgrind doing early stack traces). > At the moment, two main things to look at: > 1. how to ensure the outer tool is suppressing the errors > of the guest run by the inner. > (e.g. a race condition in glibc "printf"). > (at least, we need a way to filter the races which seems > to originate from the guest process). Otherwise, we > have hundreds of not understandable races when running th > none tests. glibc printf doesn't really have races. Problem is that glibc uses its own in-line locking for files and hg/drd can't see that. So it looks like it has a race. > Maybe we need the inner to inform the outer of the relationship > between JIT code and the guest code, so as to let the outer > apply suppression rules on the JITted guest. Yes, so we have patch for that. Maybe we should clean it up and apply it. > 2. are the tools properly understanding the inner behaviour ? > E.g. it might be that the clone syscall will have to be understood > by the outer helgrind. Yes, we need to make this work properly. We will have to be a bit careful (when explaining to the tools that sys_clone creates a h-b edge between parent and child) because apps that create threads using pthread_create create an edge in the pthread_create wrapper, pthread_create calls clone (obviously) so if we naively annotate clone then we will have 2 mutually redundant h-b edges. Maybe it doesn't matter. There is a flag "--sim-hints=enable-outer" that we could maybe use for this, to select at run-time the required behaviour. > * Assuming the outer tools are ok : do we have enough regression tests > to find the race conditions ? No we don't -- as you say, most of the tests are non-threaded. Many of the tests test low-frequency (error case etc) paths in the code base. The danger we have is that we need to have MT (hg/drd) coverage of these, but almost all of them are single threaded > On this side, I vaguely thought to the possibility to automatically > compile all the tests as a .so (transforming with a "sed" the main > into a .so entry point), and then have a way to run together > a bunch of these .so in parallel. > These tests would not be ok as functional tests (they would very > probably fail) but they might be useful in a outer helgrind/inner > setup to find race conditions. This could be a good thing -- probably some version of it is necessary. However, imagine a test that tests some low frequency path, and no other test checks the same path. Then the big-so-on-Helgrind approach will not detect races in the path (IIUC). Another option might be to make each test individually multithreaded (by running N versions of main() in parallel). This would give us a much better chance to these low-freq paths. > On the TT/TC, I think we have two remaining problems: > > 1. race condition on the tt_fast. For this, I hope the "xor" technique > can be shown (proved?) as correct. I should have commented on the xor thing before. Basically I didn't understand it. My problem is that (G xor H, H) contains neither less nor more information than (G, H) -- given any 2 of G, H, G xor H, we can compute the third one. So I don't see how it helps. Also, I didn't see any mention of memory barriers/fences in the description, so it must be incomplete (?) since we can't specify lockless algorithms without also saying where the fences must be. > > Maybe Kim Hazelwood's PhD thesis contains some useful clues on > > this? http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf > > Quickly looked at it. On the thread side, the main reasoning > is that the threads are not sharing a lot of code, and so > a shared cache is not such a good idea. To me, it looks > somewhat not intuitive that there is very little sharing between > threads. Yes, I agree. They are going to share libc stuff, + supporting library code (X, Qt, etc). > > * Memcheck. I suspect we will need to redesign and reimplement > > > > the shadow memory stuff (mc_main.c). It has many optimisations > > both to increase speed and reduce memory consumption > > (http://valgrind.org/docs/shadow-memory2007.pdf) and > > I believe not all of them will work in a MT environment. > > > > The IR-level instrumentation stuff (mc_translate.c) should be > > pretty much unchanged. > > > > IOW, Memcheck's helper functions are (very) racey, but the > > generated instrumentation code is not (I think). > > > > One comment is (as you speculate) that we cannot afford to > > do atomic operations for each helperc_* call -- those are > > called once per memory reference of the original program, > > and this would kill performance. > > On this, I am wondering if we really have a huge problem > for the "fast memcheck case". > Basically, as long as the memory is < 32 Gb and the bytes > are fully initialised, V and A bits can (I believe) be maintained > without any locking/protection: if the guest process has no > race condition, then modifying V and A bits will also not be > racey (is this really true ? I think this is probably true, not 100% sure though. For sure we can never do 100% accurate simulation of compare-and-swap since the shadow-CAS and real-CAS are necessarily non-atomic. But maybe this doesn't matter much. > 1 byte shadows 4 bytes. So > 4 threads modifying each one byte might not be racey, but > the underlying V/A byte will be shared and racey). This is the main problem I identified, when looking at it. Yes. > If we cannot use atomic instruction and the above is not ok, > I suspect we will have to go to a "one byte shadows one byte". Problem with that is, the current scheme is a big performance win because it generates fewer cache misses. /me really doesn't want to return to a 1:1 scheme. > The pointers to the secondary maps could be maintained with > a reader lock less approach (RCU, hazard pointer, ...). > (reader lock less just mean the pointer themselves are not > often changed. The pointed to data (the SecMap) is of course > changed a lot. ok. (if you say so!) J |