|
From: Philippe W. <phi...@sk...> - 2012-06-13 18:38:40
|
Below is a text describing the prototype/problems/questions/... on how to make Valgrind running threads in parallel. The below text and the related patches have been attached to bug https://bugs.kde.org/show_bug.cgi?id=301830 It is the idea that very soon, the below text will be put in SVN docs/internals. It is initially inlined in this mail to trigger discussions/comments. Updates of the below text will then be tracked under SVN. Philippe ---------------------------------------------------------------------------- Valgrind (3.8.0) supports multi-threaded applications, but schedules only one thread at a time. In other words, a multi-threaded application running under Valgrind will not benefit from multiple CPUs. A prototype has been developped of a "really" multi threaded Valgrind (mtV). The prototype is made of ugly and/or inefficient kludges/trials/... Now, you have been warned that you will encounter horrors. The only objective of this prototype is to understand the problems of doing an mtV, see what are the possible approaches, possible performance gains. The current state of the patches and/or tests results are maintained in bugzilla in bug https://bugs.kde.org/show_bug.cgi?id=301830 The document starts with a description of the current behaviour and structure of Valgrind. It then contains a description of the prototype developped, the techniques used, the problems still opened. The document below has some sections. Section title starts with a #. The first section is some general information, which might be skipped by most V developpers. # The 3 Valgrind "layers": ========================== ------------------------------------------------------------------------ | | | | TOOL "runtime code" | "instrument function" | ------------------------------------------------------------------------ | | | GUEST JIT code | | (generated/instrumented code) | -------------------------------------------------------------------- | | | CORE layer | -------------------------------------------------------------------- CORE layer : this contains the V framework & modules used to support the TOOL code. TOOL layer : made of two big parts. 1. The instrument function is called by the CORE layer. It instruments the code of the executable running under V small blocks at a time, producing "translated and instrumented" small pieces of JIT code. These pieces of JIT codes are called by the scheduler contained in the CORE layer. 2. the TOOL "runtime code" (C helper functions and data structures needed by the GUEST JIT code). For example, for memcheck, it contains the data structures to track the allocated and freed memory. GUEST layer : contains GUEST JIT : this is the process guest code translated and instrumented by the TOOL instrument function. The typical control flow is: 1. CORE layer calls the "instrument function", producing instrumented GUEST JIT code 2. CORE layer calls the GUEST JIT generated code. 3. the GUEST JIT code runs. It will typically do many calls to the tool runtime code. Note: the none tool is special : the "instrument function" does not do any code transformation. The none tool has no runtime code. The tool runtime code can itself do calls to the CORE layer. Example: the guest process code is: char *s; s = malloc (10); strcpy(s, "abc"); (*fnptr)(); The following sequence will happen: 1. CORE will decode the guest process code. 2. CORE will call the TOOL function to instrument this decoded code For memcheck, the instrumented code will be made of: a call to the memcheck malloc replacement function (this replacement function is part of memcheck runtime code, it a.o. tracks the allocated memory to allow leak search). an assignment of the result of the malloc replacement to s a call to a tool C helper to indicate that s (the ptr) is now initialized a call to the function strcpy a (dynamic) call to the function pointed to by fnptr. 3. CORE will translate the instrumented code to executable code (JIT) and will store this code in a data structure (the translation table). Basically, the translation table is a mapping between a guest code address and the corresponding translated JIT code. 4. CORE scheduler will call the JIT code (the just translated piece of code) 5. the JIT code does the following: call the tool malloc replacement function, which does: call the CORE code to allocate a block of memory (10 bytes + some admin overhead). call the CORE code to compute (and store) a stack trace of the guest code call stack. call the CORE code to store in an hash table the address of the allocated block + the stack trace reference maintain in a memcheck data structure that the 10 bytes are accessible but not initialized. the JIT code should normally call strcpy. Because strcpy is not translated yet, the call code is jumping to So, the JIT code executes a return, giving back the control to the CORE scheduler 6. the scheduler will translate the strcpy function (generating a new piece of instrumented JIT code) and add it in the translation table. Possibly, the table is full. Then the old translations have to be removed from the table. 7. The scheduler then calls the new JIT code The JIT code will execute the strcpy instrumented JIT code (this code will typically do multiple calls to the tool runtime code to indicate that the 4 first bytes pointed to by s are now initialized). 8. The JIT code will search in the translation table for the translated piece of code corresponding to *fnptr. If existing, it is called. Otherwise, control is given back to scheduler for instrumentation. For a typical tool and guest application being executed, most of the time is spent in GUEST JIT code, in the TOOL runtime code and in the CORE code. It is deemed that the time spent in TOOL instrument function is not a major part : usually, a translated piece of code is executed often. If we have an application containing multiple threads, there is no (to be more precise, almost no) parallel execution : at any moment in time, there is maximum one thread which "really" executes some code. This thread can be busy either in the CORE layer (possibly calling the TOOL instrument function), or can be busy in GUEST JIT code or in the TOOL runtime code. This "single thread busy" model is needed because neither the CORE layer, nor the TOOL layer (runtime or instrument) are re-entrant. These 3 layers are containing global variables/data structures/... which can only be used by one thread at a time. To ensure thread-safety of this non-reentrant code, V uses the very simple approach of "single thread busy". The "single thread busy" is ensured by the "Big Lock". A thread must acquire the Big Lock before it executes CORE or GUEST or TOOL code. When a thread acquires the Big Lock, it becomes THE running thread. The running thread releases the Big Lock after it has executed a certain quantity of GUEST code. Releasing the Big lock will allow another thread to run during some time. The other threads (ready to run) are all blocked, trying to acquire the Big Lock. One of these threads will acquire the Big Lock and start to execute some code. Typically, it will start to execute GUEST JIT code. But it might also start to execute CORE code because some guest code has to be instrumented. If the running thread executing GUEST JIT code has to execute a syscall (e.g. reading data on a socket), the running thread must release the Big Lock : if this thread would not release the Big Lock before doing a syscall, then the whole application will be blocked waiting for this thread to finish its syscall. In other words, the only real parallelism provided by Valgrind is between one thread doing some CPU in CORE/GUEST/TOOL and all other threads doing a system call in GUEST JIT code. This is a very low level of parallelism. If an application does an intensive usage of threads, the typically slowdown factor of the V tool (e.g. 5 to 20 for memcheck) will be multiplied by the serial execution of the CORE/GUEST/TOOL code. The objective of the prototype is to determine how to best increase the parallelism of Valgrind. # the non-thread safe code ========================== To increase the performance, we want to have multiple threads running in parallel (e.g. executing the code of the example). For this, the Big Lock must be removed and replaced by other techniques : in all the above steps, global data structures are used that will be corrupted by parallel access : almost all the steps are not thread safe if the Big Lock is removed. Typically, for the translation part, the following are not thread safe : the VEX lib (used by the CORE and tool instrument function) the tool instrument function itself the CORE translation framework ... At runtime (executing JIT code), e.g. the following are not thread safe the CORE scheduler the memcheck C helpers memcheck malloc replacement memcheck V and A bits C helpers the CORE malloc/free the CORE aspacemgr (used e.g. by the CORE malloc/free) ... At many places in the code, we have statistical counters to help understand the behaviour/performance of V. (e.g. CORE malloc is counting how many bytes have been malloc-ed). These counters are all non-thread safe. The guest code might or might not be thread safe (e.g. might contain race condition). Such non thread safe guest code is not a problem for our objective, as such non thread safety will only corrupt the guest process data structure (the developper of the guest code might use helgrind to search for such non thread safe code). However, currently, even thread-safe guest code can result in non thread safe JIT code. For example, two threads executing at the same time a call to a not yet translated function (e.g. strcpy) will corrupt the JIT code (t-chaining). Making all the above thread safe can be done using various techniques: using thread local storage instead of global variables using mutex locks using read/write locks using atomic instructions (e.g. for statistical counters) using lock-less algorithms (e.g. to maintain data structures without using locks. Such lock-less algorithms are based on atomic instructions) Note: mutex can be implemented using atomic instructions and OS syscalls (e.g. based on futex). If there is no contention, the perf. of mutex based approach will be very probably similar (or even maybe better?) than a lock-less algorithm. Lock-less algorithms might however help either to increase the throughput or to avoid difficulties such as deadlocks. The prototype has explored some of the above techniques. Some "testing" (ha ha) has been done. See # testing below. At this stage, only the thread safe aspect of CORE and JIT layer was looked at for the prototype. Nothing has been done to look at the tool runtime layer. # prototype =========== The very first version of the prototype was based on the following idea (looking somewhat wrong a posteriori): When a thread executes guest code, it has to use some data structures (e.g. the translation table, the JIT code itself, ...). These structures can't be protected by a mutex, otherwise we are back to stage 1: no parallelism. So, let's transform the Big Lock into a rwlock. Before executing guest code, a thread must acquire the Big Lock in read mode. Before doing any modifications to any data structure, the thread must acquire the Big Lock in write mode. So, the Big Lock was transformed into a rwlock by using low level "semaphores". These semaphores are the same as the current trunk scheduler "big lock", i.e. implemented using a pipe. !!! Fair scheduler has not been looked at. The rwlock implemented on top of these semaphores is not efficient as it implies several read/write syscall to acquire/release the lock. Really With this, a bunch of changes were needed to avoid assertions being raised. A.o.: * the concept of THE lock owner disappears so this cannot be checked anymore * the concept of THE running thread disappears similarly * the concept of a global 'in generated code' disappears * a bunch of failing asserts have been #ifdef-ed or commented. After fixing the above, the 4 threads were able to run in parallel. One observe the following on the "big_parallel_sleepers" test: * during +- 40 seconds, about 125% CPU is used * after that, jump to 400%. Interpretation of this: making the translations and storing them in the transtab is still very serialized by the Big Lock. Once all the code is translated, full speed is possible. See also the "duplicate t-chaining" below. helgrind (trunk) was then used in a setup "outer trunk helgrind/inner none tool prototype" to find race conditions. A bunch of such race conditions were found. Typically: * race conditions in statistical counters * ... need to retrieve and document more of these changes done. There are some tricky races as some actions looks like they are just reading some data, but in fact they do modifications to global variables. E.g. all the functions which are maintaining a small static local cache will write to the local cache. Typical example in the aspacemgr or in the transtab. The transtab (VG_(tt_fast)) is the worst case because: 1. it is used a lot 2. it is used by the asm dispatch code. => using a rw lock for this might not be an ideal solution. See # tt_fast below for another suggested approach. Currently the VG_(tt_fast) race condition is not fixed in the prototype. The other data structures related to translation tables were first tried to be protected the following way: Before having to search or modify the translation table, the Big Lock has to be acquired in W mode. So, if a thread had the Big Lock in R mode, it was reacquiring it in W mode (by releasing and re-acquiring). Really not efficient, the effect of this was really bad: The max CPU usage dropped from 400% till 140%. Worse, the total elapsed time to run the test was bigger than with the V trunk. The conclusion of this was : a (reasonable) mtV cannot be based purely on a RW Big Lock. * So, started to make a pub_tool_lock.h and m_lock.c module. (currently only contains a mutex, based on atomic instructions and futex syscall). It should at least be completed with a rwlock. Maybe also spin locks ? Some constraints for this module: * it should be initialisable very early in the startup sequence (as e.g. aspacemgr will need it. Even maybe DebugLog might need locks). The m_lock.c is derived from (lgpl 2.1) NPTL glibc 2.13 code. Atomic instructions for x86 and amd64 were also copied from the NPTL pthread lib and slightly transformed. Basically, removed the 'catomic*' kind of actions, which are optimisations done referencing a global pthread NPTL var to "skip" the LOCK prefix when there is only one thread. !!! There is one asm statement in priv_atomic_x86.h giving a compile/assembly error. Replaced by __sync_add_and_fetch temporarily. This should be fixed. !!! the fair scheduler should be rewritten based on the atomic instructions rather than the __sync.. and __builtin so that it will be available everywhere. !!! need to see if all 32 bits platforms are supporting an atomic increment of a 64 bit counter. Might either not be supported or cause problems like the x86 compile error above. Then these will have to be emulated. This might all be costly so we might have to avoid some of the 64 bits statistical counters. * The above efficient futex based mutex was used to protect the translation table. With this, many race condiions disappeared. CPU usage back to 400%. * VG_(unknown_SP_update) is a perf. critical function. This function contains a piece of non thread safe code for detecting stack changes. This code has been disabled waiting for a proper solution. A possible approach is to use TLS : the current_stack global variable would become a 'per thread' variable. See # TLS below for the trial of thread local storage. * aspacemgr locking: race condition detected e.g. between VG_(am_find_nsegment) and a mmap syscall, calling VG_(am_notify_client_mmap) => a mutex was added in aspacemgr, protecting at least the race between these 2 functions. However, it is far to be clear that the aspacemgr is "safe" with that. Effectively, VG_(am_find_nsegment) returns a pointer to an element of the segment array. So, if another thread is modifying the entry in the array once the first thread has got an access to it, then what ? It looks however that VG_(am_find_nsegment) is used for "small durations". But still it is not clear this is a "clean thread safe interface". It is unclear exactly what and how should be protected in the aspace mgr. Protecting all the public interface of aspacemgr could be done (need to avoid recursive locking, as pub_tool_lock.h does not detect that). Howeve, unclear if this is good enough (or not). * When doing the trial of protecting aspacemgr globally, deadlock obtained due to recursive locking. => one might need to make "safe" locks (currently, m_lock.c does not verify non-recursive locking). Also, if we do plenty of fine grained locking everywhere, deadlocks might be difficult to avoid. Lock-free algorithms might help to avoid these. See # Lock free algorithms below. * Need to avoid "duplicate t-chaining". Got an assert failure in VEX, as the place to t-chain did not contain the expected bytes. This was created by two threads detecting at the same time that a t-chainng is done. Then, even if Write Locked one after each other, the 2nd thread would assert as the expected bytes to replace for t-chain are not there anymore. So, in the t-chaining protected critical section, there is a verification if the t-chaining has not been done in the meantime. This causes a really bizarre perf impact (search for really bizarre in the # Performance measurements) : running 150 iterations with 4 parallel threads is total less cpu than in serial. But running only 2 iterations is significantly slower in parallel than serial. The most probable explanation: I suppose there is some useless work done e.g. during translation which consumes CPU : a thread detects it must do t-chaining" : it exits to the scheduler, takes the write big lock, detects that this t-chaining has already be done by another thread and so has done all this work for nothing. This hypothesis is confirmed when giving -d. This causes plenty of traces: --22919:1:transtab host code 0x40537438F already chained => no chaining redone But by which miracle are the parallel threads "recuperating" this cpu burned for nothing later on when doing more iterations ? Whatever: a possible solution might be to have VEX provide an efficient way to check that the current ip of the thread has been already chained. Then a "already done t-chaining" would only cost an exit to the C scheduler, and a few "VEX" if-s. This condition looks easy to implement in VEX. Is it safe ? I believe yes: At least VEX can detect that the t-chaining has been done already (or rather that the place is not t-chainable anymore). I suppose just a "!" on this condition would do it : if a place has to be chained, then it contains (for x86) BA <4 bytes value == disp_cp_chain_me_EXPECTED> FF D2 otherwise it contains something different. ??? is it so clear that this will avoid burning cpu : when a thread exits to C, it will try to acquire the big lock to do t-chaining, but will not be able till all threads go out of the JIT code either due to need of t-chaining or due to QUANTUM expired). If all exits due to QUANTUM expired, our thread will get the big lock, do the t-chaining and all is well. But if the other threads have to do the same t-chaining, they will all exits, see that the t-chaining is not done, and then queue to acquire the write big lock. One will get it, do the t-chaining, then all others will one by one get the write lock, and see the t-chaining is done. Threads have quite some chance to get to the same not done t-chaining if they all execute the same code. ??? or even isn't it that the translation table and the rd/wr big lock protecting it is just not scalable : whenever there are some threads which are executing JITted code not yet translated and/or t-chained, there is a high level of contention, causing a lot of lock/unlock, consuming user and sys cpu ? # tt_fast "xor" approach ======================== (the tt_fast race condition has not been solved yet. Here is a suggested very elegant approach. But is this really working ?) tt_fast is used (read only) by the asm dispatcher. tt_fast is accessed for every 'dynamic' call/jump/... (the 'static' call/jump/... are resolved using translation chaining). tt_fast is modified either when a new translation is done or when an already translated piece of code is found in the translation cache. In other words, one see that even if tt_fast+translation table is logically only read, it is also modified by a search. http://www.cis.uab.edu/hyatt/hashing.html describes a technique which (I believe) would allow to use tt_fast without locking and without atomic instruction, with very few changes in the asm dispatcher. Basically, the idea of the paper applied on tt_fast would be: tt_fast is an array of pair (G, H) where G is a guest code address and H is the address of the JIT translation of G. (G, H) is stored at a position in tt_fast obtained by "hashing" G (basically, shifting and masking some bits of G). If we have a pair (G1,H1) and (G2,H2) which have the same hash value, and these pairs are inserted (or modified) in parallel, a third thread reading this table might get one of the following 4 pairs: (G1,H1) (good) (G2,H2) (good) (G1,H2) (bad) (G2,H1) (bad) The idea is that the asm dispatcher would detect the bad cases, and then just fall back to the normal search (exiting the asm dispatcher to do a full search in the translation table). To differentiate the good from the bad (without an ugly lock :), the idea is to store in tt_fast the following: (G xor H, H) Then when searching for G1, one does the following: k = hash(G1) g_xor_h = tt_fast(k).g h = tt_fast(k).h if (g_xor_h xor h == G1) then h is H1 else a full search in translation table is needed as either we have (G2,H2) (a good case, but not ok for us) or one of the (temporary) bad case. A bad case will be "repaired" by the next full search which goes to the same hash bucket. I believe that even without memory __sync_synchronize, this should all work : either the pair is consistent and usable or it is inconsistent and not usable. # remaining race conditions =========================== Currently, there is (for the none tool) at least the following race conditions to fix: * these are counters used for sanity checking or gdbserver polling Probably to be fixed either by TLS or by atomic instructions. ==23743== Location 0x284d2880 is 0 bytes inside local var "slow_check_interval" ==23743== Location 0x284d2884 is 0 bytes inside local var "next_slow_check_at" ==23743== Location 0x28670dac is 0 bytes inside local var "sanity_slow_count" ==23743== Location 0x28670e30 is 0 bytes inside local var "vgdb_next_poll" ==23743== Location 0x2866aafc is 0 bytes inside local var "busy" * to be analysed: many races reported on none/tests/pth_once (and other similar tests). Looks like the clone syscall needs to lock somewhat more things) Maybe it needs to lock rw ? Or maybe helgrind needs a hb relationship on the clone syscall (helgrind does a hb relationship for pthread level but not for the os level) ? ==29332== Location 0x284fd2c8 is 0 bytes inside local var "nsegments_used" ==29332== Location 0x28c70010 is 0 bytes inside vgPlain_threads[2].exitreason, ==29332== Location 0x28c70020 is 0 bytes inside vgPlain_threads[2].arch.vex.host_EvC_FAILADDR, ==29332== Location 0x28c70028 is 0 bytes inside vgPlain_threads[2].arch.vex.host_EvC_COUNTER, ==29332== Location 0x28c71b00 is 0 bytes inside vgPlain_threads[2].sig_mask.sig[0], ==29332== Location 0x28c71b44 is 0 bytes inside vgPlain_threads[2].os_state.threadgroup, .... too much races to all put here ... * VG_(tt_fast) xor technique to be implemented: ==23743== Location 0x28fe1350 is 0 bytes inside vgPlain_tt_fast[*].guest, ==23743== Location 0x28fe1358 is 0 bytes inside vgPlain_tt_fast[*].host, # debuglog ========== Output can currently be done by multiple threads in parallel (e.g. in case of a crash/inconsistency, each thread might detect this and report the state of the scheduler at crash time). This causes output to be mixed/unreadable. We might maybe have a mutex to have each debugLog being a single syscall and have a flush for each write. But some messages (e.g. a stack trace) are made of several debuglog calls ? # testing ========= At this stage, very little testing done. The "testing" (ha ha) of the prototype was done mostly using parallel_sleepers (a slightly modified version of gdbserver_tests/sleepers.c by removing setaffinity to a single CPU) or by big_parallel_sleepers (sleepers.c more heavily modified to have multiple threads executing in parallel the code of perf/bigcode1). These two test programs have command line arguments to indicate to 4 threads to do a mix of cpu burning or syscall (sleep during some milli-seconds). When telling to only burn CPU, for a perfectly parallel V, the 4 threads should consume 400% of cpu (on a >= 4 CPU system). There will be a whole additional bunch of tests needed before we could have a reasonable trust in the race-free state of a V. For example, we will need test programs doing mmap/munmap and similar with multiple threads in parallel. 9 June: none tests run succesfully on f12/x86 and deb6/amd64. (does not mean much: we for sure still have race conditions). 12 June: run the none tests in an outer helgrind inner none config. A bunch of race conditions to analyse (a lot seems caused by the clone syscall which might not be understood by helgrind). Also, the outer.log file contains some null bytes which are written. Need to see why these are produced (might be just the outer/inner setup : to be verified with an inner trunk untouched). # TLS ===== It is highly probable that mtV will need an efficient way to retrieve "per thread" values. This looks needed e.g. for VG_(unknown_SP_update). But it will also very probably be needed for the tool runtime code (to e.g. retrieve the current thread which is calling the tool runtime code, or retrieve per thread tool data structures). There exists multiple ways to manage TLS. For example, a TLS variable defined in a shared lib is managed differently than in the statically linked part of the executable. The runtime requirements are limited for what is called the ELD 'local-exec' model. Basically, the only need is to have for each thread (including the main thread) a static zone of memory big enough for all the TLS variable in the statically linked part of the executable. TLS variables in shared libs are not supported (it is suspected that shared libs specified at link time might work but not tested. In any case, V does not use such linked shared libs). This static zone of memory has been added in the VG_(threads) array. For the main thread, very early in the V startup sequence, the thread register is modified to point at this zone. There are small differences depending on the platforms. All these differences are hidden in m_tls.c. For the non main threads, the clone syscall must specify the TLS area of the thread. Some operations might be needed before and after the clone. Again, the specifics of these operations is inside m_tls.c. Currently, m_tls.c is working on linux x86/amd64/ppc32/ppc64. It is unclear how to implement TLS on MacOS (missing documentation) Arm (Android or Linux) is also still to be looked at. Android TLS seems to be emulated (at least with current gcc NDK) so TLS will not be ultra fast. s390x also still to be looked at (probably easy). * TLS storage is not necessarily properly/efficiently supported in all environment, depending e.g. on the OS version and/or of the gcc version. Assuming we cannot make the __thread keyword properly supported in all the environments, here is a sketch of how such TLS could be replaced by something less comfortable but still providing the same 'per thread variable'. The assumption is that inside the CORE code, we (most of the time) have either the tid or the thread state pointer. With the thread state pointer, one can efficiently reference an offset in the thread state. So, thread local variables can be defined by an offset (obtained at module init time). Then the address of a local thread variable is obtained by e.g. int * my_thread_specific_counter = GETTLS(tst, my_thread_specific_counter_offset); and then *my_thread_specific_counter can be safely used as a TLS pointer variable. The JIT code must then pass the thread state to all the C helpers which need to have access to TLS. The thread state pointer is normally efficiently available in a register. This is less comfortable than __thread attributes, and will oblige to put the thread state as argument to many C helpers. If TLS storage is really needed, this might be better than nothing. # outer helgrind difficulties ============================= Running an outer helgrind on an inner (parallel) V is a very easy way to find (some?) race conditions. However, several traps were encountered, and are documented here. 1. Encountered a crash in the outer helgrind when adding some hg client request in the inner. This crash seems to be linked with the stack of a new thread (or main thread?) not being registered early enough. Then if the outer helgrind has to do a stacktrace at this early stage, the stack trace code was crashing. This has been bypassed by the following patch in the outer V. It is not very clear what to do. A proper (earlier) registration of the stack might solve the problem. Maybe the hack below is not such an horrible hack at the end: if stack_limits cannot find the stack of the stack pointer, then it looks unwise to allow unwinding without taking reasonable stack limits into account. =================================================================== --- coregrind/m_stacktrace.c (revision 12593) +++ coregrind/m_stacktrace.c (working copy) @@ -801,6 +801,8 @@ /* See if we can get a better idea of the stack limits */ VG_(stack_limits)( (Addr)startRegs.r_sp, &stack_lowest_word, &stack_highest_word ); + // Hack to avoid crash in outer Valgrind: + if (stack_lowest_word == 0) stack_highest_word = 0; // Hack /* Take into account the first_ip_delta. */ startRegs.r_pc += (Long)(Word)first_ip_delta; 2. Non thread safe aspect in the guest process are detected by the outer helgrind. However, this is detected in the JIT code generated by the inner V. So, the outer V does not understand the origin of this race condition. This means the suppression (for e.g. the normal libc race conditions) do not work. Adding --read-var-info=yes helped to determine these races as the outer V was able to point at global variables in libc. The assumption taken was that any stack trace that the outer V cannot properly associate with some inner code is considered as an irrelevant race condition. 3. mutex order: the Big Lock was transformed in a rwlock using two "lower level" locks (more exactly token passing). I believe the rwlock code is correct, but still helgrind detects problems in it e.g. because one thread releases a lock that was taken by another thread (not abnormal, as the low level "locks" are in fact a token). The helgrind marking code in the low level locks was removed. Only the Big Lock itself was marked with RW lock annotations. 4. helgrind does not understand that the clone syscall is introducing a happens-before relationship between the actions in the parent thread before the clone, and the actions in the child thread after the clone. A trial was done to mark this relationship, but it did not help much (probably because the race condition is detected in the asm just after the clone syscall. The HG annotations cannot be put in asm code, and so might not be done precisely enough to avoid hg to report the error. # how to schedule the changes ============================= Obtaining an mtV might imply some significant changes in the code. It might be more difficult to implement on macOS or on android than on the other Linux platforms. Making an mtV which works better for multi threaded apps might make the (maybe more important case ?) of a non-threaded application slower. Also, each tool will have to be updated to be really multi-threaded. So, here are a bunch of questions: * Should we try to have a core library "single threaded" and another one "multi-threaded" ? Each tool would link with the appropriate lib. Also, one might imagine that a tool could work both in a single threaded setup, or in a multi-threaded version. This all might make the transition and/or testing easier/less risky. * Alternatively, one could imagine that the core would switch from "null" locking primitives to "real locking primitives" just before the first clone syscall is done. E.g. we would have a struct containing the addresses of the locks/unlocks primitives to call. Initially, they would point at empty procedures. When a first clone syscall is to be executed, the struct would be set to point to real locking primitives. Depending on the tool, one might then have a Bool singleV (or mtV) which serialises the threads for the not yet (or not to be) parallelised tool. The scheduler policy (currently --fair-sched) could be replaced by --sched=[a list of policy to try] Then depending on the tool and the OS/platform support, the first working policy would be selected. Then we would have something like: --sched=parallel,fair,generic will take the first working in the order p/f/g or --sched=fair will fail if fair scheduler not available. The tool would have to agree on the policy. So, a tool not yet mtV ready would only agree on fair and generic. Core would then ensure that the calls to the tool layer is serialised. This allows a gradual migration of (some of) the tools to a mtV. Note: this looks to be a good balance between having 2 different coregrind libraries (mtV and serial) and oblige all tools to either be all migrated to mtV or to have a lock/unlock for each helper call. # parallelising memcheck ======================== The prototype shows that we can relatively easily parallelise the none tool, and have good scalability. (there is however very probably poor scalability when translating and t-chaining. Unclear if this is a real life problem, or only a problem for the big_parallel_sleeper test program). We need to have at least one useful tool made parallel to be convinced that mtV can be useful. memcheck (the most used tool) is a very good candidate for that. A simple approach is to have a "big tool lock" (BTL) (similarly to the core BL). (Possibly the BTL could be a rwlock). Whenever a C helper is called, it takes the BTL. This will then make the tool code itself thread-safe. It is assumed (but is this really the case ?) that the tool instrument function generates JIT code which is itself thread-safe. In other words, only the c helper might need protection. Note: this is not necessarily the case. Eg. if memcheck generates IR which does read/write in a global data structure directly, then the JIT code itself is not thread-safe. Such non thread safe was ok with a big lock allowing only one thread to execute JITted code at a time. Approach is dead for a mtV tool. If this is the case for memcheck, then it looks like the IR will have to be changed. The BTL will also automatically protect the data structure of the core which are only used by the tool (for example, the error list mgr). However, there are many core data structures which can be used both directly by the core and (indirectly or directly) by the tool runtime code. For example, VG_(malloc) can be called by the core code and by the tool runtime code. The aspacemgr data structure is another similar example. Probably we need to have specific locks for this core/tool shared data structures ? However, using a BTL has a big performance drawback (at least for most tools, which have C helpers called very frequently) : acquiring and releasing the BTL for each C helper will very probably be a perf disaster. Even without contention, each C helper call will imply (for a simple mutex) two atomic instructions. But with such a simple approach, it is probable that there will be a very high contention on the BTL (is there some statistics about the ratio between CPU time spend in core versus JITted code versus tool runtime ?). In case of contention, the price for each c helper call will be a few atomic instructions and two syscalls. So, a BTL might be useful, but for operations done very frequently (e.g. V and A bits related C helpers), it will be needed to have something better. Options to look at: 1. have "smaller grain" locking for some data structures e.g. rather than to use the BTL in some C helpers, have a different lock for each 64 KB of memory tracked by the memcheck memory map. This will reduce the contention assuming that most threads do not work often with the same 64Kb of memory. 2. use lock-less algorithms. See below # lock free algorithms ... Not much done yet on the aspect of parallelising memcheck. Need to find relevant ways to attack the problem. ... # lock free algorithms ====================== Searched on internet to try to find lock free algorithm or code. http://www.concurrencykit.org seems an attractive candidate to provide: * predefined atomic primitives for a bunch of arch. * if arch not supported natively, it fallsback on gcc builtin * provides multiple type of locks * relatively good documentation * it has several lock free data structures that might be useful. * no dependency to pthread, libc, and similar. * very few usage of standard headers * Contact was taken with the main developper, who was positive about doing the changes needed to allow usage of ck inside Valgrind. # Performance measurements ========================== Mostly done on gcc20, amd64. Most recent measurements at the top. time ./vg-in-place --tool=none ../small_programs/big_parallel_sleepers 150 1 300000 BSBSBSBS 9 June Prototype V2 (with futex rwlock, rather than 3 low level sema) real 2m49.817s user 10m8.686s sys 0m6.332s Trunk real 10m22.792s user 10m20.171s sys 0m5.212s 9 June, first measurements with an *UNPROTECTED* memcheck: Prototype V2 real 9m23.239s user 35m53.615s sys 0m3.056s Trunk real 36m31.915s user 36m26.805s sys 0m7.952s 7 June Prototype V2 real 2m52.839s user 10m17.147s sys 0m8.469s Trunk real 10m40.479s user 10m36.788s sys 0m6.808s 7 June, same test but replacing 150 by 2. time ./vg-in-place --tool=none ../small_programs/big_parallel_sleepers 2 1 300000 BSBSBSBS ?????? really bizarre. See t-chaining above. Prototype V2 real 0m23.689s user 0m25.042s sys 0m6.428s Trunk real 0m13.583s user 0m13.477s sys 0m0.124s 6 June Prototype V1 real 3m7.894s user 10m49.673s sys 0m10.561s Trunk real 11m26.408s user 11m22.483s sys 0m7.600s ------------------------------------------------------------------------ |
|
From: Philippe W. <phi...@sk...> - 2012-06-14 22:06:01
|
On Thu, 2012-06-14 at 22:43 +0200, Julian Seward wrote:
> My big observations are:
>
> * we need to solve some of the problems to do with Helgrind + DRD
> as "outers". They are best way we have to verify that mtV is
> race-free. That is important not only for this initial
> development, but also for regression testing -- IOW, we now
> depend on Helgrind/DRD to ensure the system works properly.
The validation and subsequent reg-test is effectively critical.
On that side, we have two aspects:
* are the outer tools (helgrind/drd) ok ?
From what I can see, these tools are working quite well
on this problem. There are a few quirks (I will commit today
a small change to help outer helgrind doing early stack traces).
At the moment, two main things to look at:
1. how to ensure the outer tool is suppressing the errors
of the guest run by the inner.
(e.g. a race condition in glibc "printf").
(at least, we need a way to filter the races which seems
to originate from the guest process). Otherwise, we
have hundreds of not understandable races when running th
none tests.
Maybe we need the inner to inform the outer of the relationship
between JIT code and the guest code, so as to let the outer
apply suppression rules on the JITted guest.
2. are the tools properly understanding the inner behaviour ?
E.g. it might be that the clone syscall will have to be understood
by the outer helgrind.
* Assuming the outer tools are ok : do we have enough regression tests
to find the race conditions ?
I suspect that now a lot of regression tests are single threaded.
For example, all the mmap reg tests do not use threads.
Without "threaded tests" for these things, races in the V core
handling of e.g. mmap will be very difficult to detect.
On this side, I vaguely thought to the possibility to automatically
compile all the tests as a .so (transforming with a "sed" the main
into a .so entry point), and then have a way to run together
a bunch of these .so in parallel.
These tests would not be ok as functional tests (they would very
probably fail) but they might be useful in a outer helgrind/inner
setup to find race conditions.
>
> * You discuss in some detail, problems with multithreading TT/TC
> and in particular the handling of chaining. My question:
> instead of trying to incrementally "fix" the existing design,
> do we maybe need to think about fundamentally re-architecting
> it somehow?
>
> For example, an extreme solution (which probably isn't
> desirable, but which would solve all the problems) is for each
> thread to have its own TT,TC,cache,etc. Perhaps there are
> intermediate states which can help? I am just saying, let us
> not assume that the eventual solution to this is an all-shared
> approach.
On the TT/TC, I think we have two remaining problems:
1. race condition on the tt_fast. For this, I hope the "xor" technique
can be shown (proved?) as correct.
"memory model" knowledge useful here
(Bart, are you reading this ? :)
2. the possible performance problem implied by a lock/unlock
for each t-chaining or translation.
The 2nd problem might effectively imply to go to a per-thread TT/TC
This might however re-create performance problem due to memory pressure
and/or the need to translate the same code for each thread.
It might also make the gradual migration more difficult if the
data structures are very different between a "serial" and a "parallel"
tool.
Maybe we will need some kind of adaptative locking : a thread would
keep a write lock even when executing JIT code if it detects that it
t-chains or translates "a lot" ? After a "QUANTUM" of t-chain or
translation, it would switch to read lock again.
In other words, V scheduler would switch between "serial" and
"parallel" behaviour depending on the "t-chaining+translation
demand rate".
For sure, a difficult aspect.
>
> Maybe Kim Hazelwood's PhD thesis contains some useful clues on
> this? http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf
Quickly looked at it. On the thread side, the main reasoning
is that the threads are not sharing a lot of code, and so
a shared cache is not such a good idea. To me, it looks
somewhat not intuitive that there is very little sharing between
threads.
> * Memcheck. I suspect we will need to redesign and reimplement
> the shadow memory stuff (mc_main.c). It has many optimisations
> both to increase speed and reduce memory consumption
> (http://valgrind.org/docs/shadow-memory2007.pdf) and
> I believe not all of them will work in a MT environment.
>
> The IR-level instrumentation stuff (mc_translate.c) should be
> pretty much unchanged.
>
> IOW, Memcheck's helper functions are (very) racey, but the
> generated instrumentation code is not (I think).
>
> One comment is (as you speculate) that we cannot afford to
> do atomic operations for each helperc_* call -- those are
> called once per memory reference of the original program,
> and this would kill performance.
On this, I am wondering if we really have a huge problem
for the "fast memcheck case".
Basically, as long as the memory is < 32 Gb and the bytes
are fully initialised, V and A bits can (I believe) be maintained
without any locking/protection: if the guest process has no
race condition, then modifying V and A bits will also not be
racey (is this really true ? 1 byte shadows 4 bytes. So
4 threads modifying each one byte might not be racey, but
the underlying V/A byte will be shared and racey).
If we cannot use atomic instruction and the above is not ok,
I suspect we will have to go to a "one byte shadows one byte".
The pointers to the secondary maps could be maintained with
a reader lock less approach (RCU, hazard pointer, ...).
(reader lock less just mean the pointer themselves are not
often changed. The pointed to data (the SecMap) is of course
changed a lot.
As V has somewhat a control over which thread is doing what,
maybe a simple schema is possible for lock less pointer management.
Volunteers to help welcome, as I suspect that my single threaded
brain during my evenings will not be sufficient :).
Philippe
|
|
From: Julian S. <js...@ac...> - 2012-06-14 20:44:10
|
On Wednesday, June 13, 2012, Philippe Waroquiers wrote: > Below is a text describing the prototype/problems/questions/... > on how to make Valgrind running threads in parallel. Excellent work on a difficult problem, let me first say. My big observations are: * we need to solve some of the problems to do with Helgrind + DRD as "outers". They are best way we have to verify that mtV is race-free. That is important not only for this initial development, but also for regression testing -- IOW, we now depend on Helgrind/DRD to ensure the system works properly. * You discuss in some detail, problems with multithreading TT/TC and in particular the handling of chaining. My question: instead of trying to incrementally "fix" the existing design, do we maybe need to think about fundamentally re-architecting it somehow? For example, an extreme solution (which probably isn't desirable, but which would solve all the problems) is for each thread to have its own TT,TC,cache,etc. Perhaps there are intermediate states which can help? I am just saying, let us not assume that the eventual solution to this is an all-shared approach. Maybe Kim Hazelwood's PhD thesis contains some useful clues on this? http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf * Basic infrastructure (TLS support, m_locks.c) is obviously important and we should do a good job on that. * We need a good tool-migration story since some tools actually rely on single-threadedness (Helgrind, for one) and cannot easily be multithreaded. Your "gradual migration" scheme sounds pretty good. * Memcheck. I suspect we will need to redesign and reimplement the shadow memory stuff (mc_main.c). It has many optimisations both to increase speed and reduce memory consumption (http://valgrind.org/docs/shadow-memory2007.pdf) and I believe not all of them will work in a MT environment. The IR-level instrumentation stuff (mc_translate.c) should be pretty much unchanged. IOW, Memcheck's helper functions are (very) racey, but the generated instrumentation code is not (I think). One comment is (as you speculate) that we cannot afford to do atomic operations for each helperc_* call -- those are called once per memory reference of the original program, and this would kill performance. Overall, I think the two big remaining problems are TT-TC and Memcheck. Once we have a plausible solution to them, we'll be better prepared to assemble a prototype implementation. J |
|
From: Philippe W. <phi...@sk...> - 2012-06-17 13:44:01
|
On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote: > > > * Memcheck. I suspect we will need to redesign and reimplement > > > > > > the shadow memory stuff (mc_main.c). It has many optimisations > > > both to increase speed and reduce memory consumption > > > (http://valgrind.org/docs/shadow-memory2007.pdf) and > > > I believe not all of them will work in a MT environment. > > > > > > The IR-level instrumentation stuff (mc_translate.c) should be > > > pretty much unchanged. > > > > > > IOW, Memcheck's helper functions are (very) racey, but the > > > generated instrumentation code is not (I think). > > > > > > One comment is (as you speculate) that we cannot afford to > > > do atomic operations for each helperc_* call -- those are > > > called once per memory reference of the original program, > > > and this would kill performance. > > > > On this, I am wondering if we really have a huge problem > > for the "fast memcheck case". > > Basically, as long as the memory is < 32 Gb and the bytes > > are fully initialised, V and A bits can (I believe) be maintained > > without any locking/protection: if the guest process has no > > race condition, then modifying V and A bits will also not be > > racey (is this really true ? > > I think this is probably true, not 100% sure though. For sure we > can never do 100% accurate simulation of compare-and-swap since > the shadow-CAS and real-CAS are necessarily non-atomic. But maybe > this doesn't matter much. > > > 1 byte shadows 4 bytes. So > > 4 threads modifying each one byte might not be racey, but > > the underlying V/A byte will be shared and racey). > > This is the main problem I identified, when looking at it. Yes. > > > If we cannot use atomic instruction and the above is not ok, > > I suspect we will have to go to a "one byte shadows one byte". > > Problem with that is, the current scheme is a big performance win > because it generates fewer cache misses. /me really doesn't want > to return to a 1:1 scheme. I experimented a little bit with memcheck. Finding a solution to the VA bits will be a real challenge. I did several trials (for the LOAD/STORE helperc only, as these are the critical ones, and only for the fast cases): * use mechanically mutex lock/unlock in all LOAD/STORE helperc (this was just for the fun/for reference/for desperating ourselves :). The perf is (of course) horrible : bz2 is about 4 times slower big_parallel_sleepers is xxx times slower (due to contention) totally unusable as expected. * just have an atomic instruction before/after the LOAD/STORE (so basically inlining the mutex fast case) bz2 is more than 2 times slower (did not measure big-parallel_sleepers, but for sure is perf is horrible, see below). * when just having one atomic instruction per LOAD/STORE (so assuming we can have a lock-less algorithm with one atomic instruction) bz2 is 55% slower * when just having one atomic instruction per STORE (assuming LOAD do not need atomic if the guest process has no race condition. This looks a reasonable assumption) bz2 is 4% slower big_parallel_sleepers is still horribly slower (3 times in real time, 10 times in cpu) Unclear to me why bz2 is not slowed down more: can the cpu detect that there is no other thread that can access the memory and so does "less atomic work" for bz2 ?) Or is it because big_parallel_sleepers is an horrible test case ? (this is just bigcode1 perf test but running in parallel 4 bigcode1). * when just having one atomic instruction per STORE8 (assuming the other STORE16/32/64 do not need atomicity if the guest process is race free) bz2 is about 3% slower big parallel sleepers speed is acceptable (15% more cpu, real time divided by 2). I am however not at all convinced that only protecting the STORE8 will avoid false positive (we could maybe tolerate a small rate of false negative when running in parallel). Even if protecting the STORE8 is good enough, then any parallel algorithm which works a lot with single bytes might be slowed down by a factor 10 or similar. So, at this stage, it looks unclear to me that there is any other approach than going for a 1-1 mapping for a parallel memcheck. Now, similarly to the "with origin" or "without origin", we might have helperc STORE/LOAD "with_1_1 mapping/without_1_1_mapping" helperc. (not necessarily linked to the scheduler policy. It might be that a 1-1 mapping could be faster in some cases than the current schema for "serial" scheduler policy). I start to believe that tt_fast and the transtab is "relatively easy" to parallelise (the prototype is already more or less ok, with a little bit of fine tuning, it should be acceptable) but that the real challenges are: memcheck VA bits is the real performance challenge. proper testing of all that Philippe |
|
From: Philippe W. <phi...@sk...> - 2012-06-15 21:19:48
|
On Thu, 2012-06-14 at 22:43 +0200, Julian Seward wrote: > * You discuss in some detail, problems with multithreading TT/TC > and in particular the handling of chaining. My question: > instead of trying to incrementally "fix" the existing design, > do we maybe need to think about fundamentally re-architecting > it somehow? On that fundamental re-architecting (not related specifically to TT/TC) : I am also somewhat afraid that the mtV prototype might be uselessly complex : maybe there is a simple(r) approach to make the core multi threaded, which I am not seeing : I am far to master all the aspects of Valgrind, and I do not have a big experience in multi-threading. Any alternate proposal is welcome (or any comments on the current approach). Philippe |
|
From: Eliot M. <mo...@cs...> - 2012-06-15 21:51:46
|
On 6/15/2012 5:20 PM, Philippe Waroquiers wrote: > On Thu, 2012-06-14 at 22:43 +0200, Julian Seward wrote: > >> * You discuss in some detail, problems with multithreading TT/TC >> and in particular the handling of chaining. My question: >> instead of trying to incrementally "fix" the existing design, >> do we maybe need to think about fundamentally re-architecting >> it somehow? An approach I was thinking you might consider is: - When a thread needs to JIT code, it JITs into memory space that it "owns" for the purpose, i.e., there are per-thread JIT buffers, possibly acquired (from time to time, in large enough chunks to mitigate overhead of acquiring them) from a global pool. Such acquisition would have to be synchronized (i.e., use a lock or something) but would be infrequent, with most work happening into local buffers. - When a thread is done JITing a chunk, it installs it into a global lookup table atomically. It might be possible for two threads to try to JIT the same chink at the same time. They might each generate code then both try to install, with one "winning" and one "losing". The "loser" would simply discard its translation and use the other one. - Obviously this sometimes wastes work. If such waste is common, you could add an atomic operation that notes "thread xyz is working on translating this". Other threads would wait until xyz finishes. I've not dug around in the code much, but it wouldn't surprise me if significant data structures would have to become per-thread instead of global/static. This could be painful to "fix", requiring more arguments to calls, or lookups for per-thread data, etc. Regards -- Eliot Moss |
|
From: Philippe W. <phi...@sk...> - 2012-06-15 22:15:48
|
On Fri, 2012-06-15 at 17:51 -0400, Eliot Moss wrote:
> An approach I was thinking you might consider is:
>
> - When a thread needs to JIT code, it JITs into memory space that
> it "owns" for the purpose, i.e., there are per-thread JIT buffers,
> possibly acquired (from time to time, in large enough chunks to
> mitigate overhead of acquiring them) from a global pool. Such
> acquisition would have to be synchronized (i.e., use a lock or
> something) but would be infrequent, with most work happening
> into local buffers.
>
> - When a thread is done JITing a chunk, it installs it into a global
> lookup table atomically. It might be possible for two threads to
> try to JIT the same chink at the same time. They might each generate
> code then both try to install, with one "winning" and one "losing".
> The "loser" would simply discard its translation and use the other
> one.
>
> - Obviously this sometimes wastes work. If such waste is common, you
> could add an atomic operation that notes "thread xyz is working on
> translating this". Other threads would wait until xyz finishes.
If I understand well, the idea is to have a (limited) nr of JITted
translations in a 'per thread' cache. Then from time to time,
a thread would push these JITted pieces to the common cache.
Effectively, having 'per thread' data is a very good way to avoid
contention.
Two difficulties however:
1. the VEX library itself is not thread safe. So, a mutex is needed
there in any case (or will need to make VEX thread safe, not
looked at all till now).
2. not too clear how to handle the t-chaining between
thread local JITted pieces and the global JITted pieces.
e.g. if we have a code A that calls B; we possibly have
the following chainings (where the L indicates it is in
the local thread cache of thread L) :
AL -> BL
AL -> B
A -> B
A -> BL
removing translation A or B in the global cache might e.g. imply
to touch the local caches (to remove the links). We cannot have
the thread L busy executing AL or BL during that time.
Similarly, if AL or BL are discarded, it implies to touch A or B
in the global cache.
Philippe
|
|
From: Julian S. <js...@ac...> - 2012-06-17 11:21:10
|
On Friday, June 15, 2012, Eliot Moss wrote:
I think Philippe already commented on this, but to add my 2 euro-cents ..
> An approach I was thinking you might consider is:
>
> - When a thread needs to JIT code, it JITs into memory space that
> it "owns" for the purpose, i.e., there are per-thread JIT buffers,
> possibly acquired (from time to time, in large enough chunks to
> mitigate overhead of acquiring them) from a global pool. Such
> acquisition would have to be synchronized (i.e., use a lock or
> something) but would be infrequent, with most work happening
> into local buffers.
>
> - When a thread is done JITing a chunk, it installs it into a global
> lookup table atomically. It might be possible for two threads to
> try to JIT the same chink at the same time. They might each generate
> code then both try to install, with one "winning" and one "losing".
> The "loser" would simply discard its translation and use the other
> one.
>
> - Obviously this sometimes wastes work. If such waste is common, you
> could add an atomic operation that notes "thread xyz is working on
> translating this". Other threads would wait until xyz finishes.
This is a plausible scheme, as far as it goes. The drawback is that the
problem is more complex than merely JITting code, in 2 ways:
* at some arbitrary point after JITting (but often very soon), the new
code has to be patched ("translation chained") so it jumps directly to
downstream translations. So we want to avoid races during patching,
or perhaps use some glorified CAS scheme, since it's an atomic replacement
of a handful of insn bytes.
* even later on, code will need to be discarded. This happens when libraries
are unmapped, and in other situations. So the mapping tables then need to
be updated. Also, this is much more complex in the presence of chaining,
since we first need to un-chain any translations that jump to (iow, have
been chained to) the just about to be deleted translation(s).
> I've not dug around in the code much, but it wouldn't surprise me if
> significant data structures would have to become per-thread instead
> of global/static. This could be painful to "fix", requiring more
> arguments to calls, or lookups for per-thread data, etc.
Yeah, I also wouldn't be surprised to find that.
J
|
|
From: Julian S. <js...@ac...> - 2012-06-17 11:08:59
|
On Friday, June 15, 2012, Philippe Waroquiers wrote: > The validation and subsequent reg-test is effectively critical. > On that side, we have two aspects: > * are the outer tools (helgrind/drd) ok ? > From what I can see, these tools are working quite well > on this problem. There are a few quirks (I will commit today > a small change to help outer helgrind doing early stack traces). > At the moment, two main things to look at: > 1. how to ensure the outer tool is suppressing the errors > of the guest run by the inner. > (e.g. a race condition in glibc "printf"). > (at least, we need a way to filter the races which seems > to originate from the guest process). Otherwise, we > have hundreds of not understandable races when running th > none tests. glibc printf doesn't really have races. Problem is that glibc uses its own in-line locking for files and hg/drd can't see that. So it looks like it has a race. > Maybe we need the inner to inform the outer of the relationship > between JIT code and the guest code, so as to let the outer > apply suppression rules on the JITted guest. Yes, so we have patch for that. Maybe we should clean it up and apply it. > 2. are the tools properly understanding the inner behaviour ? > E.g. it might be that the clone syscall will have to be understood > by the outer helgrind. Yes, we need to make this work properly. We will have to be a bit careful (when explaining to the tools that sys_clone creates a h-b edge between parent and child) because apps that create threads using pthread_create create an edge in the pthread_create wrapper, pthread_create calls clone (obviously) so if we naively annotate clone then we will have 2 mutually redundant h-b edges. Maybe it doesn't matter. There is a flag "--sim-hints=enable-outer" that we could maybe use for this, to select at run-time the required behaviour. > * Assuming the outer tools are ok : do we have enough regression tests > to find the race conditions ? No we don't -- as you say, most of the tests are non-threaded. Many of the tests test low-frequency (error case etc) paths in the code base. The danger we have is that we need to have MT (hg/drd) coverage of these, but almost all of them are single threaded > On this side, I vaguely thought to the possibility to automatically > compile all the tests as a .so (transforming with a "sed" the main > into a .so entry point), and then have a way to run together > a bunch of these .so in parallel. > These tests would not be ok as functional tests (they would very > probably fail) but they might be useful in a outer helgrind/inner > setup to find race conditions. This could be a good thing -- probably some version of it is necessary. However, imagine a test that tests some low frequency path, and no other test checks the same path. Then the big-so-on-Helgrind approach will not detect races in the path (IIUC). Another option might be to make each test individually multithreaded (by running N versions of main() in parallel). This would give us a much better chance to these low-freq paths. > On the TT/TC, I think we have two remaining problems: > > 1. race condition on the tt_fast. For this, I hope the "xor" technique > can be shown (proved?) as correct. I should have commented on the xor thing before. Basically I didn't understand it. My problem is that (G xor H, H) contains neither less nor more information than (G, H) -- given any 2 of G, H, G xor H, we can compute the third one. So I don't see how it helps. Also, I didn't see any mention of memory barriers/fences in the description, so it must be incomplete (?) since we can't specify lockless algorithms without also saying where the fences must be. > > Maybe Kim Hazelwood's PhD thesis contains some useful clues on > > this? http://www.cs.virginia.edu/kim/docs/phd-thesis.pdf > > Quickly looked at it. On the thread side, the main reasoning > is that the threads are not sharing a lot of code, and so > a shared cache is not such a good idea. To me, it looks > somewhat not intuitive that there is very little sharing between > threads. Yes, I agree. They are going to share libc stuff, + supporting library code (X, Qt, etc). > > * Memcheck. I suspect we will need to redesign and reimplement > > > > the shadow memory stuff (mc_main.c). It has many optimisations > > both to increase speed and reduce memory consumption > > (http://valgrind.org/docs/shadow-memory2007.pdf) and > > I believe not all of them will work in a MT environment. > > > > The IR-level instrumentation stuff (mc_translate.c) should be > > pretty much unchanged. > > > > IOW, Memcheck's helper functions are (very) racey, but the > > generated instrumentation code is not (I think). > > > > One comment is (as you speculate) that we cannot afford to > > do atomic operations for each helperc_* call -- those are > > called once per memory reference of the original program, > > and this would kill performance. > > On this, I am wondering if we really have a huge problem > for the "fast memcheck case". > Basically, as long as the memory is < 32 Gb and the bytes > are fully initialised, V and A bits can (I believe) be maintained > without any locking/protection: if the guest process has no > race condition, then modifying V and A bits will also not be > racey (is this really true ? I think this is probably true, not 100% sure though. For sure we can never do 100% accurate simulation of compare-and-swap since the shadow-CAS and real-CAS are necessarily non-atomic. But maybe this doesn't matter much. > 1 byte shadows 4 bytes. So > 4 threads modifying each one byte might not be racey, but > the underlying V/A byte will be shared and racey). This is the main problem I identified, when looking at it. Yes. > If we cannot use atomic instruction and the above is not ok, > I suspect we will have to go to a "one byte shadows one byte". Problem with that is, the current scheme is a big performance win because it generates fewer cache misses. /me really doesn't want to return to a 1:1 scheme. > The pointers to the secondary maps could be maintained with > a reader lock less approach (RCU, hazard pointer, ...). > (reader lock less just mean the pointer themselves are not > often changed. The pointed to data (the SecMap) is of course > changed a lot. ok. (if you say so!) J |
|
From: Philippe W. <phi...@sk...> - 2012-06-17 13:11:47
|
(will reply in 2 other mails for the xor and for the memcheck aspects). On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote: > > Maybe we need the inner to inform the outer of the relationship > > between JIT code and the guest code, so as to let the outer > > apply suppression rules on the JITted guest. > > Yes, so we have patch for that. Maybe we should clean it up and > apply it. Yes, having the outer be able to understand the JITted code should allow to write suppression files for an outer tool detecting (false or true) errors in the guest process run by the inner (for an outer helgrind, suppress races. for an outer memcheck, suppress undefined jumps etc). > if we naively annotate clone then we will have 2 mutually redundant h-b edges. > Maybe it doesn't matter. > > There is a flag "--sim-hints=enable-outer" that we could maybe use for this, > to select at run-time the required behaviour. Effectively, if two h-b redundant is problematic, we could activate this only with enable-outer hint (or maybe helgrind could recognise that the clone is due to a pthread call being done using e.g. internal client request in the pthread_create intercept ?). > This could be a good thing -- probably some version of it is necessary. > However, imagine a test that tests some low frequency path, and no > other test checks the same path. Then the big-so-on-Helgrind approach > will not detect races in the path (IIUC). Another option might be to > make each test individually multithreaded (by running N versions of main() > in parallel). This would give us a much better chance to these low-freq > paths. Yes, having (many more) specific multi-threaded tests will be needed. I hope that a ".so loader/test driver" will be able to automatically parallelise many of the current regression tests (i.e. the main becomes an entry point, the test driver loads this .so (and/or some others) and then starts 1 or more threads executing these .so in parallel (same .so can be executed more than once). It for sure implies to rework some of the test cases (e.g. all global variables in the test might have to be marked with __thread attribute). And some hard-coded values (e.g. mmap addresses) might have to be made thread specific. > > Quickly looked at it. On the thread side, the main reasoning > > is that the threads are not sharing a lot of code, and so > > a shared cache is not such a good idea. To me, it looks > > somewhat not intuitive that there is very little sharing between > > threads. > > Yes, I agree. They are going to share libc stuff, + supporting library > code (X, Qt, etc). For sure, that is the case for any kind of "general" application using threads. For applications which are using threads to parallelise algorithms, each thread will run exactly the same bunch of code. I suspect this case is becoming more and more frequent. Philippe |
|
From: Philippe W. <phi...@sk...> - 2012-06-17 13:30:21
|
On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote:
> > On the TT/TC, I think we have two remaining problems:
> >
> > 1. race condition on the tt_fast. For this, I hope the "xor" technique
> > can be shown (proved?) as correct.
>
> I should have commented on the xor thing before. Basically I didn't
> understand it. My problem is that (G xor H, H) contains neither less
> nor more information than (G, H) -- given any 2 of G, H, G xor H, we
> can compute the third one. So I don't see how it helps. Also, I didn't
> see any mention of memory barriers/fences in the description, so it
> must be incomplete (?) since we can't specify lockless algorithms
> without also saying where the fences must be.
I understand the idea is:
We have an address G1, and we want to find H1.
G1 is (using shift and mask) used to index in the tt_fast array.
Today (trunk), the array contains pairs (G, H).
After indexing, the pair is checked to be valid by comparing
G1 and G : if equal, the pair can be used, otherwise
the pair cannot be used : it has been replaced by e.g. (G2,H2)
or never corresponded to (G1,H1)
With the xor technique:
Similarly to the above, G1 is used to index tt_fast
However, the array contains pairs (G_xor_H, H).
The pair found from G1 can be:
a valid pair for G1, so be (G1_xor_H1/H1)
a valid pair but not for G1, so e.g. was replaced by (G2_xor_H2/H2)
any invalid pair (where i and j are whatever numbers != 1).
(Gi_xor_Hi,Hj), (G1_xor_H1, Hj), (Gi_xor_Hi, H1)
A pair can be invalid for whatever reason (e.g. in the middle of being
updated by another thread, or because the cpu has only "forwarded" half
of the stores because we have not used fence and atomic instructions).
The idea is that in all of the above valid or invalid cases,
the xor will allow to verify that the pair is valid, corresponds to G1
and so the H (H1) can be used.
Otherwise, the pair cannot be used, the asm dispatcher goes back
to C code, where the full transtab is searched, and tt_fast is
subsequently updated.
The question is: can
Gi_xor_Hi xor Hj give G1
for any other combination than i = 1 and j = 1 ?
(we assume that all Gx are unique and all Hx are unique,
but that looks a sane assumption).
Philippe
|
|
From: Philippe W. <phi...@sk...> - 2012-06-18 06:17:14
|
On Sun, 2012-06-17 at 15:30 +0200, Philippe Waroquiers wrote: > On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote: > > > On the TT/TC, I think we have two remaining problems: > > > > > > 1. race condition on the tt_fast. For this, I hope the "xor" technique > > > can be shown (proved?) as correct. > > > > I should have commented on the xor thing before. Basically I didn't > > understand it. My problem is that (G xor H, H) contains neither less > > nor more information than (G, H) -- given any 2 of G, H, G xor H, we > > can compute the third one. So I don't see how it helps. Also, I didn't > > see any mention of memory barriers/fences in the description, so it > > must be incomplete (?) since we can't specify lockless algorithms > > without also saying where the fences must be. Looking more in depth in that, I think this xor technique does not work (it just decreases the probability to have an undetected race condition or I missed something in the paper describing this technique). So, for tt_fast, back to square 1. Philippe looks like my brain is not made to think parallel and race conditions :(. |
|
From: John R. <jr...@bi...> - 2012-06-17 15:13:57
|
On 06/17/2012, Philippe Waroquiers wrote: > I am however not at all convinced that only protecting the STORE8 > will avoid false positive (we could maybe tolerate a small rate > of false negative when running in parallel). > Even if protecting the STORE8 is good enough, then any parallel > algorithm which works a lot with single bytes might be slowed > down by a factor 10 or similar. Use LoadLocked and StoreConditional instructions, as on MIPS. These sense the state of the cache line, and implement a "greedy" solution. Your code provides the fixup when greedy fails (usually: try again, after re-fetch and re-modify.) -- |
|
From: Philippe W. <phi...@sk...> - 2012-06-17 16:46:52
|
On Sun, 2012-06-17 at 08:14 -0700, John Reiser wrote: > On 06/17/2012, Philippe Waroquiers wrote: > > > I am however not at all convinced that only protecting the STORE8 > > will avoid false positive (we could maybe tolerate a small rate > > of false negative when running in parallel). > > Even if protecting the STORE8 is good enough, then any parallel > > algorithm which works a lot with single bytes might be slowed > > down by a factor 10 or similar. > > Use LoadLocked and StoreConditional instructions, as on MIPS. > These sense the state of the cache line, and implement a "greedy" > solution. Your code provides the fixup when greedy fails > (usually: try again, after re-fetch and re-modify.) Are the LL/SC instructions not suffering from the same performance degradation as the x86/amd64 atomic instructions ? (cfr the unacceptable performance degration in memcheck when adding one atomic instruction in the STORE helperc). I have very bad knowledge of all these atomic things and similar, but I am guessing that there is no order of magnitude difference of performance between these approaches. >From what I understood, on x86/amd64, we would need such operations to be faster by two order of magnitude to be usable for memcheck. Philippe |
|
From: Bart V. A. <bva...@ac...> - 2012-06-17 17:34:06
|
On 06/17/12 16:47, Philippe Waroquiers wrote: > On Sun, 2012-06-17 at 08:14 -0700, John Reiser wrote: >> On 06/17/2012, Philippe Waroquiers wrote: >> >>> I am however not at all convinced that only protecting the STORE8 >>> will avoid false positive (we could maybe tolerate a small rate >>> of false negative when running in parallel). >>> Even if protecting the STORE8 is good enough, then any parallel >>> algorithm which works a lot with single bytes might be slowed >>> down by a factor 10 or similar. >> >> Use LoadLocked and StoreConditional instructions, as on MIPS. >> These sense the state of the cache line, and implement a "greedy" >> solution. Your code provides the fixup when greedy fails >> (usually: try again, after re-fetch and re-modify.) > > Are the LL/SC instructions not suffering from the same performance > degradation as the x86/amd64 atomic instructions ? > (cfr the unacceptable performance degration in memcheck when > adding one atomic instruction in the STORE helperc). > > I have very bad knowledge of all these atomic things and similar, > but I am guessing that there is no order of magnitude difference > of performance between these approaches. > From what I understood, on x86/amd64, we would need such operations > to be faster by two order of magnitude to be usable for memcheck. It looks like a good idea to me to read more about existing multithreaded dynamic analysis tools first. An example: Paul Sack e.a., Accurate and efficient filtering for the Intel thread checker race detector, ASID '06 Proceedings of the 1st workshop on Architectural and system support for improving software dependability, Pages 34 - 41, 2006. Bart. |
|
From: Julian S. <js...@ac...> - 2012-06-19 16:51:12
|
On Sunday, June 17, 2012, Philippe Waroquiers wrote: > > Use LoadLocked and StoreConditional instructions, as on MIPS. > > These sense the state of the cache line, and implement a "greedy" > > solution. Your code provides the fixup when greedy fails > > (usually: try again, after re-fetch and re-modify.) > > Are the LL/SC instructions not suffering from the same performance > degradation as the x86/amd64 atomic instructions ? > (cfr the unacceptable performance degration in memcheck when > adding one atomic instruction in the STORE helperc). > > I have very bad knowledge of all these atomic things and similar, > but I am guessing that there is no order of magnitude difference > of performance between these approaches. I would agree with your guess. In both cases (LL/SC or LOCK), some kind of global communication across all cores is implied, in the worst case. One thing I do know (from testing on an old Core 2), is that the cost of LOCK depends on how far down the cache hierarchy the line in question is. In the case where the line is in that core's D1, then the cost was "only" about 10-15 cycles instead of the often-rumoured 100+ cycles. But I am not claiming to understand anything much about this, really; in particular I have no idea if such performance variations are something that it would be safe for us to depend on in the general case. J |
|
From: Josef W. <Jos...@gm...> - 2012-06-19 19:22:05
|
Am 19.06.2012 18:50, schrieb Julian Seward: > On Sunday, June 17, 2012, Philippe Waroquiers wrote: > >>> Use LoadLocked and StoreConditional instructions, as on MIPS. >>> These sense the state of the cache line, and implement a "greedy" >>> solution. Your code provides the fixup when greedy fails >>> (usually: try again, after re-fetch and re-modify.) >> >> Are the LL/SC instructions not suffering from the same performance >> degradation as the x86/amd64 atomic instructions ? >> (cfr the unacceptable performance degration in memcheck when >> adding one atomic instruction in the STORE helperc). >> >> I have very bad knowledge of all these atomic things and similar, >> but I am guessing that there is no order of magnitude difference >> of performance between these approaches. > > I would agree with your guess. In both cases (LL/SC or LOCK), some > kind of global communication across all cores is implied, in the > worst case. Isn't the important thing the common case? If the address of a lock or a memory cell to observe (LL/SC) is already owned (=exclusive/modified state) by a private L1 of a core, another modification by that core should not trigger any transaction to the outside, if the L1 is write-back. Hmm. In the end, it probably does not change much. More interesting would be to use HW transactional memory for this (only available with Haswell processors next year. Or BG/Q). Josef > > One thing I do know (from testing on an old Core 2), is that the cost > of LOCK depends on how far down the cache hierarchy the line in question > is. In the case where the line is in that core's D1, then the cost was > "only" about 10-15 cycles instead of the often-rumoured 100+ cycles. > > But I am not claiming to understand anything much about this, really; > in particular I have no idea if such performance variations are something > that it would be safe for us to depend on in the general case. > > J > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: John R. <jr...@bi...> - 2012-06-20 14:59:20
|
On 06/19/2012 09:50 AM, Julian Seward wrote: > On Sunday, June 17, 2012, Philippe Waroquiers wrote: > >>> Use LoadLocked and StoreConditional instructions, as on MIPS. >>> These sense the state of the cache line, and implement a "greedy" >>> solution. Your code provides the fixup when greedy fails >>> (usually: try again, after re-fetch and re-modify.) >> >> Are the LL/SC instructions not suffering from the same performance >> degradation as the x86/amd64 atomic instructions ? >> (cfr the unacceptable performance degration in memcheck when >> adding one atomic instruction in the STORE helperc). >> >> I have very bad knowledge of all these atomic things and similar, >> but I am guessing that there is no order of magnitude difference >> of performance between these approaches. > > I would agree with your guess. In both cases (LL/SC or LOCK), some > kind of global communication across all cores is implied, in the > worst case. Yes, in the worst case some kind of global communication across all cores is required. But if there is no contention then the cost is very low, often zero. This is faster than a x86-style bus-locked atomic operation, which always costs at least 10 to 15 cycles even when no contention. Providing hardware support to separate the read from the write, and integrating the inspection of cache state into the LL+SC pair, are big advantages. When there _is_ contention, then [a single attempt at] LL+SC does not have to be any slower than the normal communication that is required to maintain cache coherency. So again, LL+SC provides atomic update at very low additional cost, except for the software retry that is required when there is an actual collision. A bus-locked atomic operation guarantees forward progress for each contender [as far as hardware is concerned], but at higher cost. LL+SC provides much lower cost per attempt, but at most one contender will make progress during a collision, and repeated attempts can be necessary for the same logical operation. > > One thing I do know (from testing on an old Core 2), is that the cost > of LOCK depends on how far down the cache hierarchy the line in question > is. In the case where the line is in that core's D1, then the cost was > "only" about 10-15 cycles instead of the often-rumoured 100+ cycles. > > But I am not claiming to understand anything much about this, really; > in particular I have no idea if such performance variations are something > that it would be safe for us to depend on in the general case. When there are only about 5 architectures, then using "general case" can be misleading. "Cost proportional to cache depth" is true in every actual instance that I know. However, the constant can vary a lot, even from implementation to implementation within the same architecture. For instance, typical Intel Core 2 have large L2 and no L3, but Core i3/i5/i7 have medium-sized L2 and medium-to-large L3. On average, Core 2 can be faster per cycle for some workloads, even with the same or smaller total cache size. -- |
|
From: Julian S. <js...@ac...> - 2012-06-19 16:39:54
|
On Sunday, June 17, 2012, Philippe Waroquiers wrote: > (will reply in 2 other mails for the xor and for the memcheck aspects). > > On Sun, 2012-06-17 at 13:08 +0200, Julian Seward wrote: > > > Maybe we need the inner to inform the outer of the relationship > > > between JIT code and the guest code, so as to let the outer > > > apply suppression rules on the JITted guest. > > > > Yes, so we have patch for that. Maybe we should clean it up and > > apply it. > > Yes, having the outer be able to understand the JITted code > should allow to write suppression files for an outer tool detecting > (false or true) errors in the guest process run by the inner > (for an outer helgrind, suppress races. > for an outer memcheck, suppress undefined jumps etc). Are you OK to action this, or is it something I should chase? > > if we naively annotate clone then we will have 2 mutually redundant h-b > > edges. Maybe it doesn't matter. > > > > There is a flag "--sim-hints=enable-outer" that we could maybe use for > > this, to select at run-time the required behaviour. > > Effectively, if two h-b redundant is problematic, we could activate this > only with enable-outer hint (or maybe helgrind could recognise that the > clone is due to a pthread call being done using e.g. internal client > request in the pthread_create intercept ?). I suppose having two h-b edges is not a a problem. What makes me nervous about Helgrind are the correctness aspects, though .. would prefer to only have on h-b edge (from pthread_create or clone, but not both) so if there is some other kind of implementation error in hg, it will not be hidden by the redundant edge. (Call me paranoid. I don't care :) > > This could be a good thing -- probably some version of it is necessary. > > However, imagine a test that tests some low frequency path, and no > > other test checks the same path. Then the big-so-on-Helgrind approach > > will not detect races in the path (IIUC). Another option might be to > > make each test individually multithreaded (by running N versions of > > main() in parallel). This would give us a much better chance to these > > low-freq paths. > > Yes, having (many more) specific multi-threaded tests will be needed. > I hope that a ".so loader/test driver" will be able to automatically > parallelise many of the current regression tests > (i.e. the main becomes an entry point, the test driver loads this .so > (and/or some others) and then starts 1 or more threads executing > these .so in parallel (same .so can be executed more than once). Would be good if you can make it work. Sounds like a lot of difficult Makefile.am hackery to me. > > > Quickly looked at it. On the thread side, the main reasoning > > > is that the threads are not sharing a lot of code, and so > > > a shared cache is not such a good idea. To me, it looks > > > somewhat not intuitive that there is very little sharing between > > > threads. > > > > Yes, I agree. They are going to share libc stuff, + supporting library > > code (X, Qt, etc). > > For sure, that is the case for any kind of "general" application using > threads. > For applications which are using threads to parallelise algorithms, > each thread will run exactly the same bunch of code. > I suspect this case is becoming more and more frequent. I completely agree. J |
|
From: Josef W. <Jos...@gm...> - 2012-06-19 19:08:31
|
Am 19.06.2012 18:38, schrieb Julian Seward: >>> Yes, I agree. They are going to share libc stuff, + supporting library >>> code (X, Qt, etc). >> >> For sure, that is the case for any kind of "general" application using >> threads. >> For applications which are using threads to parallelise algorithms, >> each thread will run exactly the same bunch of code. >> I suspect this case is becoming more and more frequent. > > I completely agree. Yes, it probably is mostly the same, but the question is how much code is shared in absolute numbers, as otherwise, the time for redundant JITting probably mostly can be neglected. I have the impression that typically a programmer limits the ranges of code which really runs in parallel, to not introduce bugs. With OpenMP, you usually parallelize your code incrementally, loop after loop, and each loop usually is restricted to some algorithm. Perhaps this is just my HPC-centric view... Josef |
|
From: Philippe W. <phi...@sk...> - 2012-06-19 21:18:42
|
On Tue, 2012-06-19 at 18:38 +0200, Julian Seward wrote: > On Sunday, June 17, 2012, Philippe Waroquiers wrote: > Are you OK to action this, or is it something I should chase? It does not look very difficult, so I can put this on the list of things to do (one day :). > Would be good if you can make it work. Sounds like a lot of difficult > Makefile.am hackery to me. Not straightforward probably. There is for the moment enough tests to show that there are still race conditions to look at, and perf aspects (memcheck) to look. So, maybe a volunteer will show up in the meantime ? :) Philippe |
|
From: Julian S. <js...@ac...> - 2012-06-19 16:46:58
|
On Sunday, June 17, 2012, Philippe Waroquiers wrote: > The question is: can > Gi_xor_Hi xor Hj give G1 > for any other combination than i = 1 and j = 1 ? > (we assume that all Gx are unique and all Hx are unique, My intuitive feeling is "yes", although I can't think at the moment how to prove that. J |
|
From: Patrick J. L. <lop...@gm...> - 2012-06-19 21:04:13
|
On Tue, Jun 19, 2012 at 9:45 AM, Julian Seward <js...@ac...> wrote: > On Sunday, June 17, 2012, Philippe Waroquiers wrote: > >> The question is: can >> Gi_xor_Hi xor Hj give G1 >> for any other combination than i = 1 and j = 1 ? >> (we assume that all Gx are unique and all Hx are unique, > > My intuitive feeling is "yes", although I can't think at the > moment how to prove that. G1 = 00 G2 = 11 H1 = 01 H2 = 10 i = 2, j = 1 Gi ^ Hi ^ Hj = G2 ^ H2 ^ H1 = 11 ^ 10 ^ 01 = 00 = G1 Or did I misunderstand the question? - Pat |
|
From: Philippe W. <phi...@sk...> - 2012-06-19 22:12:39
|
On Tue, 2012-06-19 at 14:04 -0700, Patrick J. LoPresti wrote: > On Tue, Jun 19, 2012 at 9:45 AM, Julian Seward <js...@ac...> wrote: > > On Sunday, June 17, 2012, Philippe Waroquiers wrote: > > > >> The question is: can > >> Gi_xor_Hi xor Hj give G1 > >> for any other combination than i = 1 and j = 1 ? > >> (we assume that all Gx are unique and all Hx are unique, > > > > My intuitive feeling is "yes", although I can't think at the > > moment how to prove that. > > G1 = 00 > G2 = 11 > H1 = 01 > H2 = 10 > > i = 2, j = 1 > > Gi ^ Hi ^ Hj = G2 ^ H2 ^ H1 = 11 ^ 10 ^ 01 = 00 = G1 > > Or did I misunderstand the question? No, you do understand the question well and your example shows that for an hash table with only one bucket, that for this single bucket, there is no way to differentiate a correct G1 => (G1_xor_H1, H1) = (01, 01) from the incorrect G1 => (G2_xor_H2, H1) = (01, 01) Not completely 100% persuaded it is also broken for a "real hash table" of two entries. (if you have a counter example for a 2 entries hash table, then I will be fully desperate that this miracle solution does not work :). For reference, this xor technique is described in: http://www.cis.uab.edu/hyatt/hashing.html and seems to be somewhat used in the world of chess programs. The above ref. tells the following: "The current scheme eliminates the locking overhead completely, and guarantees that there will be no corrupted entries that actually get used by the search." (still have 0.00000001% hope this works :) Philippe |
|
From: Julian S. <js...@ac...> - 2012-06-19 22:39:45
|
On Wednesday, June 20, 2012, Philippe Waroquiers wrote: > For reference, this xor technique is described in: > http://www.cis.uab.edu/hyatt/hashing.html I skimmed this, but failed to understand (1) why it works and (2) how it can work without memory fences. J |