Thread: Re: [Sablevm-developer] Threading support in SableVM
Brought to you by:
egagnon
From: Chris P. <chr...@ma...> - 2004-02-20 19:33:00
|
Clark VERBRUGGE wrote: >>uniprocessor case). If they are on separate processors, in order to >>meet the requirements of the JMM, each time a lock is acquired by a > > > I may have missed some earlier emails on this, but in case it wasn't > mentioned already, Doug Lea's JSR-133 cookbook shows some specific > instructions for several architectures, along with precisely where they > need to be to ensure everything respects the (new) JMM: > > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > A simple/sub-optimal approach to see if it helps is not too hard (for > intel, it looks like mfence before/after each load/store of a java > volatile should do it, given the cmpxchg in monitor ops). Inserting > appropriate sync instructions *efficiently* would make an interested > project by itself... > > That would fix the java heap. But are not internal sablevm operations > more of the problem? Are all shared data accesses (reads or writes) > already protected by mutexes? Hi Clark, Replying to your mail publicly to sablevm-developer. Thanks for the reminder about Doug Lea's cookbook. It actually indicates that MFENCE is needed for unlocking a thin lock only (no C&S), because C&S already includes a memory barrier: From the IA-32 manual on LOCK: For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception: load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. I missed this earlier on. However, I don't know if or how SableVM handles volatile Java variables. Is somebody able to comment on this? If there is no explicit treatment of volatile variables in SableVM, perhaps it is possible to enforce the JMM rules by inserting dummy method calls (using Soot) wherever an MB is needed. If these method calls are synchronized, then in the absense of a VM that recognizes them and translates them into VM-specific MB instructions, they will still enforce an MB, although it will be considerably slower. Cheers, Chris |
From: Chris P. <chr...@ma...> - 2004-02-20 19:42:24
|
Clark VERBRUGGE wrote: > That would fix the java heap. But are not internal sablevm operations > more of the problem? Are all shared data accesses (reads or writes) > already protected by mutexes? I didn't answer your question. The answer is I think so, but I don't know. I think certain things are declared as volatile so that explicit mutexes aren't required, but that this may be broken. There might be problems with both thin locks and the thread status variable, amongst other things. We'll see ... I think the first step is to make the Java heap work. After that we can try to break things again. Cheers, Chris |
From: Chris P. <chr...@ma...> - 2004-02-22 22:35:08
|
Chris Pickett wrote: > Clark VERBRUGGE wrote: > >> That would fix the java heap. But are not internal sablevm operations >> more of the problem? Are all shared data accesses (reads or writes) >> already protected by mutexes? > > > I didn't answer your question. The answer is I think so, but I don't > know. I think certain things are declared as volatile so that explicit > mutexes aren't required, but that this may be broken. There might be > problems with both thin locks and the thread status variable, amongst > other things. We'll see ... I think the first step is to make the Java > heap work. After that we can try to break things again. > Just an update on what I've found out, but first a summary for those who don't want to read all this: The problems seem to be in thread startup, and there might be something wrong with splay_tree.c. I put barriers in what I think are the right places, so I don't think the problems are JMM-related. I could be wrong about this, especially if the splay_tree problem violates the JMM rules about GC (see the "miscellany" at the end of the cookbook). Details of JMM / locking studies follow, and after that some debugging information (limited, because debugging these intermittent errors is frustrating). I don't really have any useful leads on what needs to be fixed anymore, so I'm going to stop now. I think careful study and/or unit testing of splay_tree.c might help -- I've done neither of these things. One of the more repeatable errors involves an "impossible control flow" in interpreter.c -- but I can't get an instruction trace on this so it seems useless. As a side note, Etienne, it seems strange to me that under contention, you don't inflate the actual thin lock being released in your algorithm, but you inflate all of the other locks that the thread owns. I think it means that if two threads were competing for just one lock, it would never get inflated. Cheers, Chris ====================================================================== I implemented _svmf_store_load_barrier() in system.c. It only works for i386 at the moment. Functions to implement StoreStore, LoadLoad, and LoadStore barriers are necessary for other processors (these barriers are no-ops on x86-PO, and the Athlon MP known as tofu.cs.mcgill.ca is an x86-PO). See the JSR-133 cookbook for details. The mfence instruction is illegal on this processor, so I'm using a locked add instead (as per the JSR-133 cookbook) -- it should be portable across all x86-PO's. Calling pthread_mutex_lock()/unlock() also imposes a StoreLoad barrier. These are the changes I made in my sandbox: * Implemented _svmf_store_load_barrier() for i386 in system.c * Commented _svmf_enter_object_monitor() after having gone through Bacon's, Onodera's, and Etienne's papers thoroughly. It should be easy to understand _svmf_exit_object_monitor() based on it. * Inserted StoreLoad barriers either side of setting the contention bit in _svmf_enter_object_monitor() -- recommended by Onodera * Inserted StoreLoad barriers either side of clearing the contention bit in _svmf_exit_object_monitor() -- recommended by Onodera * Inserted StoreLoad barriers either side of freeing a thin lock in _svmf_exit_object_monitor() -- recommended by Onodera * Inserted a StoreLoad barrier AFTER decrementing the recursive count of a thin lock in _svmf_exit_object_monitor() -- recommended by JSR-133 This didn't fix the test cases I have. I then looked at putting StoreLoad barriers after volatile stores. Although information about volatile fields is available at method preparation time, it is annoying to try and make volatile-safe versions of all the (get|put)(field|static) instructions (especially since I don't believe this to be the cleanest solution). So I just (temporarily) put a StoreLoad barrier between *every* instruction in the main dispatch loop. This still didn't fix things. I think ALL of the JNI functions are safe because they all call _svmf_resuming_java() which immediately calls _svmm_compare_and_swap() (except for the ones in invoke_interface.c, but I don't think that's where the problem lies). The JSR-133 cookbook says Thread.start() and Thread.join() need barriers ... but ... we don't appear to have a Thread.join() method, and Thread.start() calls pthread_create(). No errors looking for Thread.join() either. Finally, the last relevant points of the miscellany at the end of the JSR-133 cookbook address synchronization with garbage collectors, but we have a stop-the-world collector and I don't see where this could be unsafe. I even tried disabling GC altogethers. Note that I got a couple of core dumps where things died in splay_tree.c. The problems seem to be in thread startup, and there might be something wrong with splay_tree.c. 1) sablevm-chris-switch-debug Incrementer 628 sablevm: INTERNAL ERROR (source file "interpreter.c", line 330): impossible control flow 500 Aborted (gdb) bt #0 0x4012c781 in kill () from /lib/libc.so.6 #1 0x400b2e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b3339 in raise () from /lib/libpthread.so.0 #3 0x4012dbe1 in abort () from /lib/libc.so.6 #4 0x4008ba78 in _svmh_fatal_error (filename=0x400954c1 "interpreter.c", linenumber=330, msg=0x4008e56a "impossible control flow") at fatal.c:29 #5 0x4006f391 in _svmf_interpreter (_env=0x8066df0) at interpreter.c:330 #6 0x400280cc in _svmh_invoke_static_virtualmachine_runthread (env=0x8066df0) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066df0) at thread.c:1427 #8 0x400b00ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d2d6a in clone () from /lib/libc.so.6 (gdb) q 2) sablevm-chris-switch-debug Incrementer (I don't have the stdout/stderr for this): (gdb) bt #0 0x40130781 in kill () from /lib/libc.so.6 #1 0x400b6e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b7339 in raise () from /lib/libpthread.so.0 #3 0x40131be1 in abort () from /lib/libc.so.6 #4 0x4008e952 in _svmf_tree_splay_gc_map (proot=0x40098b01, node=0x14a) at splay_tree.c:408 #5 0x40070fd1 in _svmf_interpreter (_env=0x8066e00) at instructions_preparation_switch_threaded.c:286 #6 0x40028e2b in _svmh_invoke_static_virtualmachine_runthread (env=0x8066e00) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066e00) at thread.c:1427 #8 0x400b40ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d6d6a in clone () from /lib/libc.so.6 (gdb) 3) sablevm-chris-switch-debug ThreadStarter Exception in thread "Thread-1" java.lang.NullPointerException at IncrementRunnable.run (ThreadStarter.java:51) at java.lang.Thread.run (Thread.java:670) at java.lang.VMThread.callRun (VMThread.java:116) at java.lang.Thread.callRun (Thread.java:343) at java.lang.VirtualMachine.runThread (VirtualMachine.java:117) |
From: Archie C. <ar...@de...> - 2004-02-23 01:56:57
|
Chris Pickett wrote: > As a side note, Etienne, it seems strange to me that under contention, > you don't inflate the actual thin lock being released in your algorithm, > but you inflate all of the other locks that the thread owns. I think it > means that if two threads were competing for just one lock, it would > never get inflated. It's not necessary to inflate the thin lock. If thread #1 owns it and thread #2 contends for it, thread #2 sets the contention flag and puts itself on thread #1's queue. Thread #1 will wake up thread #2 when it releases the lock (after seeing it in the queue), and then thread #2 will acquire the thin lock. Only if thread #1 sees threads waiting on *different* objects must thread #1 inflate the associated lock. In effect, the condition variable associated with thread #1 itself (env->contention.requester.cond) functions as the monitor in this case. I.e., if you didn't care about efficiency, you could just never inflate any locks, and instead rely on the owning thread's condition variable. But then you'd have efficiency problems in cases where one thread owned a bunch of different locks at the same time -- all other threads waiting on any of the held locks would get woken up anytime any one individual lock was released by the thread. With the hybrid thin/fat lock scheme, only the threads that ever should be woken up are. The inflating of the "other" locks is to avoid this inefficiency. But it's not necessary to do this for threads waiting on the actual lock being released, because you do indeed want those thread to wake up. (This is my understanding, Etienne please correct if necessary...) -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Etienne G. <gag...@uq...> - 2004-02-23 04:07:53
|
Archie Cobbs wrote: > (This is my understanding, Etienne please correct if necessary...) Archie, you have given the same explanation I gave at my Ph.D. defense! I don't have anything to add. :-) Etienne -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Chris P. <chr...@ma...> - 2004-02-23 04:25:17
|
Archie Cobbs wrote: > Chris Pickett wrote: > >>As a side note, Etienne, it seems strange to me that under contention, >>you don't inflate the actual thin lock being released in your algorithm, >>but you inflate all of the other locks that the thread owns. I think it >>means that if two threads were competing for just one lock, it would >>never get inflated. > > > The inflating of the "other" locks is to avoid this inefficiency. > But it's not necessary to do this for threads waiting on the actual > lock being released, because you do indeed want those thread to > wake up. Hi Archie (and Etienne), I understand and agree with most if not all of that, certainly pertaining to the other locks. I guess I was wondering, wouldn't things be *faster* if you inflated the actual released thin lock, because you wouldn't have to take the (what I assumed/am assuming to be slow) inflation path for that lock again when future contention arose? (because it's apparently likely to arise) Or is there some reason why the inflation path is faster than the fat lock path? That's what I still don't understand. If you're telling me it simply doesn't make a difference, then fair enough. Anyway, I suppose it's an experiment to try. Something else that occurred to me: one of the supposed benefits of Bacon's/Onodera's/Etienne's thin lock algorithm is that you don't always require an atomic operation on unlocking. But since we're introducing barriers now anyway, maybe a different algorithm altogether would be better, at least for multiprocessors. So many details, so little time ... must ... not ... touch ... multiprocessor issues ... anymore. Cheers, Chris |
From: Archie C. <ar...@de...> - 2004-02-23 15:57:12
|
Chris Pickett wrote: > > The inflating of the "other" locks is to avoid this inefficiency. > > But it's not necessary to do this for threads waiting on the actual > > lock being released, because you do indeed want those thread to > > wake up. > > I understand and agree with most if not all of that, certainly > pertaining to the other locks. I guess I was wondering, wouldn't things > be *faster* if you inflated the actual released thin lock, because you > wouldn't have to take the (what I assumed/am assuming to be slow) > inflation path for that lock again when future contention arose? A lock can only be inflated once. So it's never slower to delay this inflation to a later point. But more importantly, the inflation does not necessarily have to happen. When thread #1 releases the thin lock, it wakes up thread #2. Then thread #2 retries its lock acquisition "from the beginning". If no other threads beat it out, then thread #2 sees an unlocked thin lock and acquires it using compare-and-swap. So no inflation takes place in this case. -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-23 20:38:19
Attachments:
Incrementer.java
|
Archie Cobbs wrote: > Chris Pickett wrote: > >>>The inflating of the "other" locks is to avoid this inefficiency. >>>But it's not necessary to do this for threads waiting on the actual >>>lock being released, because you do indeed want those thread to >>>wake up. >> >>I understand and agree with most if not all of that, certainly >>pertaining to the other locks. I guess I was wondering, wouldn't things >>be *faster* if you inflated the actual released thin lock, because you >>wouldn't have to take the (what I assumed/am assuming to be slow) >>inflation path for that lock again when future contention arose? > > > A lock can only be inflated once. So it's never slower to delay this > inflation to a later point. (not CCing anyone -- the list seems fairly stable these days -- please let me know if you want to be CC'd Archie, Etienne -- will ask Clark to subscribe to the list) I tried it, and have a test case where it's faster for me to inflate than not (attached). with inflation: real 2m31.746s user 2m13.740s sys 0m58.580s without inflation: real 2m39.046s user 2m51.030s sys 0m57.200s That's with the extra atomic operations on an MP system (timing information is screwed up, but I think it's safe to say that it's still faster). If I remove the barriers, and use the MP system again ... with inflation real 2m10.773s user 2m0.070s sys 0m33.280s without inflation real 2m41.173s user 2m51.180s sys 0m57.890s which actually surprised me! I don't really want to investigate this further right now, determine if it affects real-world multithreaded benchmarks, try it on uniprocessors, etc. etc., but I don't think we can categorically say that not inflating is always faster. In fact, I think that as a unit, the inflation path is slower than the fat lock path (and by not inflating we continually take the inflation path). At the same time, I think *deflation* could be a nice win, and Onodera's original algorithm actually supports this. Cheers, Chris |
From: Archie C. <ar...@de...> - 2004-02-23 23:32:13
|
Chris Pickett wrote: > I tried it, and have a test case where it's faster for me to inflate > than not (attached). Interesting! Numbers are always the final arbiter :-) -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Etienne G. <gag...@uq...> - 2004-02-23 04:20:03
|
Commenting on the summary. Chris Pickett wrote: > The problems seem to be in thread startup, and there might be something > wrong with splay_tree.c. I put barriers in what I think are the right > places, so I don't think the problems are JMM-related. I could be wrong > about this, especially if the splay_tree problem violates the JMM rules > about GC (see the "miscellany" at the end of the cookbook). A splay tree is an unbalanced binary tree which has neat properties: 1- the cost of n operations (additions, searches, deletions) is O(n log n), or in other words, the *amortized* cost per operation is O(log n). 2- This tree has a "cache" property; every time a node is accessed, the tree is reorganized such that this node becomes root, this leads to a tree where frequently accessed nodes are near the root. Empirical measurements show that this can result in much faster access than in traditional balanced binary trees (depends on the application, of course). You probably noticed the important point above: the tree is modified on all accesses. This means that, in a concurrent setting, a "read lock" is not sufficient to access the data structure. Have you double checked that your problem is not related to a concurrent or unsynchronized access to a splay tree, somewhere in SableVM? Maybe I forgot to acquire a lock somewhere? Etienne -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Chris P. <chr...@ma...> - 2004-02-23 06:39:28
|
Etienne Gagnon wrote: > Commenting on the summary. > > Chris Pickett wrote: > >> The problems seem to be in thread startup, and there might be >> something wrong with splay_tree.c. I put barriers in what I think are >> the right places, so I don't think the problems are JMM-related. I >> could be wrong about this, especially if the splay_tree problem >> violates the JMM rules about GC (see the "miscellany" at the end of >> the cookbook). > > > A splay tree is an unbalanced binary tree which has neat properties: > 1- the cost of n operations (additions, searches, deletions) is > O(n log n), or in other words, the *amortized* cost per operation > is O(log n). > 2- This tree has a "cache" property; every time a node is accessed, the > tree is reorganized such that this node becomes root, this leads to > a tree where frequently accessed nodes are near the root. Empirical > measurements show that this can result in much faster access than in > traditional balanced binary trees (depends on the application, of > course). > > You probably noticed the important point above: the tree is modified on > all accesses. This means that, in a concurrent setting, a "read lock" is > not sufficient to access the data structure. Have you double checked that > your problem is not related to a concurrent or unsynchronized access to > a splay tree, somewhere in SableVM? Maybe I forgot to acquire a lock > somewhere? Damn. I think this error is fake. I was looking at old core files, and it seems gdb produces erroneous information if the binary has since changed (understandable, but maybe gdb should compute a checksum on the binary or something ...). The errors with runThread are not fake! In any case, I added your explanation to splay_tree.m4.c, so not all is wasted. Cheers, Chris |
From: Etienne G. <gag...@uq...> - 2004-02-23 05:25:51
|
Chris Pickett wrote: > ... One of the more repeatable errors involves an "impossible > control flow" in interpreter.c -- but I can't get an instruction trace > on this so it seems useless. Can't you at least call _svmf_dump_stack_trace() [or something like that, I don't remember the exact name]? -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Chris P. <chr...@ma...> - 2004-02-23 06:05:41
Attachments:
Incrementer.java
|
Etienne Gagnon wrote: > Chris Pickett wrote: > >> ... One of the more repeatable errors involves an "impossible control >> flow" in interpreter.c -- but I can't get an instruction trace on this >> so it seems useless. > > > Can't you at least call _svmf_dump_stack_trace() [or something like that, > I don't remember the exact name]? > (Thread.java:-1) java/lang/Thread.run (VMThread.java:116) java/lang/VMThread.callRun (Thread.java:343) java/lang/Thread.callRun (VirtualMachine.java:117) java/lang/VirtualMachine.runThread sablevm: INTERNAL ERROR (source file "interpreter.c", line 333): impossible control flow (gdb) bt #0 0x4012e781 in kill () from /lib/libc.so.6 #1 0x400b4e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b5339 in raise () from /lib/libpthread.so.0 #3 0x4012fbe1 in abort () from /lib/libc.so.6 #4 0x4008cabd in _svmh_fatal_error (filename=0x40096c61 "interpreter.c", linenumber=333, msg=0x4008faca "impossible control flow") at fatal.c:29 #5 0x400701c0 in _svmf_interpreter (_env=0x8066e00) at interpreter.c:333 #6 0x40028e2b in _svmh_invoke_static_virtualmachine_runthread (env=0x8066e00) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066e00) at thread.c:1427 #8 0x400b20ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d4d6a in clone () from /lib/libc.so.6 (gdb) So it looks like the same kind of error as for the NullPointerException that gets thrown by ThreadStarter.java that I sent earlier on. (test case attached for future reference) |