sablevm-developer Mailing List for SableVM (Page 19)
Brought to you by:
egagnon
You can subscribe to this list here.
2000 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(27) |
Aug
(22) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(30) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(32) |
Oct
|
Nov
|
Dec
|
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(69) |
Sep
(10) |
Oct
(31) |
Nov
(15) |
Dec
(58) |
2003 |
Jan
(33) |
Feb
(81) |
Mar
(85) |
Apr
(24) |
May
(15) |
Jun
(14) |
Jul
(6) |
Aug
(9) |
Sep
(101) |
Oct
(59) |
Nov
(142) |
Dec
(34) |
2004 |
Jan
(107) |
Feb
(164) |
Mar
(181) |
Apr
(96) |
May
(81) |
Jun
(71) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Archie C. <ar...@de...> - 2004-02-23 15:57:12
|
Chris Pickett wrote: > > The inflating of the "other" locks is to avoid this inefficiency. > > But it's not necessary to do this for threads waiting on the actual > > lock being released, because you do indeed want those thread to > > wake up. > > I understand and agree with most if not all of that, certainly > pertaining to the other locks. I guess I was wondering, wouldn't things > be *faster* if you inflated the actual released thin lock, because you > wouldn't have to take the (what I assumed/am assuming to be slow) > inflation path for that lock again when future contention arose? A lock can only be inflated once. So it's never slower to delay this inflation to a later point. But more importantly, the inflation does not necessarily have to happen. When thread #1 releases the thin lock, it wakes up thread #2. Then thread #2 retries its lock acquisition "from the beginning". If no other threads beat it out, then thread #2 sees an unlocked thin lock and acquires it using compare-and-swap. So no inflation takes place in this case. -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-23 06:39:28
|
Etienne Gagnon wrote: > Commenting on the summary. > > Chris Pickett wrote: > >> The problems seem to be in thread startup, and there might be >> something wrong with splay_tree.c. I put barriers in what I think are >> the right places, so I don't think the problems are JMM-related. I >> could be wrong about this, especially if the splay_tree problem >> violates the JMM rules about GC (see the "miscellany" at the end of >> the cookbook). > > > A splay tree is an unbalanced binary tree which has neat properties: > 1- the cost of n operations (additions, searches, deletions) is > O(n log n), or in other words, the *amortized* cost per operation > is O(log n). > 2- This tree has a "cache" property; every time a node is accessed, the > tree is reorganized such that this node becomes root, this leads to > a tree where frequently accessed nodes are near the root. Empirical > measurements show that this can result in much faster access than in > traditional balanced binary trees (depends on the application, of > course). > > You probably noticed the important point above: the tree is modified on > all accesses. This means that, in a concurrent setting, a "read lock" is > not sufficient to access the data structure. Have you double checked that > your problem is not related to a concurrent or unsynchronized access to > a splay tree, somewhere in SableVM? Maybe I forgot to acquire a lock > somewhere? Damn. I think this error is fake. I was looking at old core files, and it seems gdb produces erroneous information if the binary has since changed (understandable, but maybe gdb should compute a checksum on the binary or something ...). The errors with runThread are not fake! In any case, I added your explanation to splay_tree.m4.c, so not all is wasted. Cheers, Chris |
From: Chris P. <chr...@ma...> - 2004-02-23 06:05:41
|
Etienne Gagnon wrote: > Chris Pickett wrote: > >> ... One of the more repeatable errors involves an "impossible control >> flow" in interpreter.c -- but I can't get an instruction trace on this >> so it seems useless. > > > Can't you at least call _svmf_dump_stack_trace() [or something like that, > I don't remember the exact name]? > (Thread.java:-1) java/lang/Thread.run (VMThread.java:116) java/lang/VMThread.callRun (Thread.java:343) java/lang/Thread.callRun (VirtualMachine.java:117) java/lang/VirtualMachine.runThread sablevm: INTERNAL ERROR (source file "interpreter.c", line 333): impossible control flow (gdb) bt #0 0x4012e781 in kill () from /lib/libc.so.6 #1 0x400b4e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b5339 in raise () from /lib/libpthread.so.0 #3 0x4012fbe1 in abort () from /lib/libc.so.6 #4 0x4008cabd in _svmh_fatal_error (filename=0x40096c61 "interpreter.c", linenumber=333, msg=0x4008faca "impossible control flow") at fatal.c:29 #5 0x400701c0 in _svmf_interpreter (_env=0x8066e00) at interpreter.c:333 #6 0x40028e2b in _svmh_invoke_static_virtualmachine_runthread (env=0x8066e00) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066e00) at thread.c:1427 #8 0x400b20ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d4d6a in clone () from /lib/libc.so.6 (gdb) So it looks like the same kind of error as for the NullPointerException that gets thrown by ThreadStarter.java that I sent earlier on. (test case attached for future reference) |
From: Etienne G. <gag...@uq...> - 2004-02-23 05:25:51
|
Chris Pickett wrote: > ... One of the more repeatable errors involves an "impossible > control flow" in interpreter.c -- but I can't get an instruction trace > on this so it seems useless. Can't you at least call _svmf_dump_stack_trace() [or something like that, I don't remember the exact name]? -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Chris P. <chr...@ma...> - 2004-02-23 04:25:17
|
Archie Cobbs wrote: > Chris Pickett wrote: > >>As a side note, Etienne, it seems strange to me that under contention, >>you don't inflate the actual thin lock being released in your algorithm, >>but you inflate all of the other locks that the thread owns. I think it >>means that if two threads were competing for just one lock, it would >>never get inflated. > > > The inflating of the "other" locks is to avoid this inefficiency. > But it's not necessary to do this for threads waiting on the actual > lock being released, because you do indeed want those thread to > wake up. Hi Archie (and Etienne), I understand and agree with most if not all of that, certainly pertaining to the other locks. I guess I was wondering, wouldn't things be *faster* if you inflated the actual released thin lock, because you wouldn't have to take the (what I assumed/am assuming to be slow) inflation path for that lock again when future contention arose? (because it's apparently likely to arise) Or is there some reason why the inflation path is faster than the fat lock path? That's what I still don't understand. If you're telling me it simply doesn't make a difference, then fair enough. Anyway, I suppose it's an experiment to try. Something else that occurred to me: one of the supposed benefits of Bacon's/Onodera's/Etienne's thin lock algorithm is that you don't always require an atomic operation on unlocking. But since we're introducing barriers now anyway, maybe a different algorithm altogether would be better, at least for multiprocessors. So many details, so little time ... must ... not ... touch ... multiprocessor issues ... anymore. Cheers, Chris |
From: Etienne G. <gag...@uq...> - 2004-02-23 04:20:03
|
Commenting on the summary. Chris Pickett wrote: > The problems seem to be in thread startup, and there might be something > wrong with splay_tree.c. I put barriers in what I think are the right > places, so I don't think the problems are JMM-related. I could be wrong > about this, especially if the splay_tree problem violates the JMM rules > about GC (see the "miscellany" at the end of the cookbook). A splay tree is an unbalanced binary tree which has neat properties: 1- the cost of n operations (additions, searches, deletions) is O(n log n), or in other words, the *amortized* cost per operation is O(log n). 2- This tree has a "cache" property; every time a node is accessed, the tree is reorganized such that this node becomes root, this leads to a tree where frequently accessed nodes are near the root. Empirical measurements show that this can result in much faster access than in traditional balanced binary trees (depends on the application, of course). You probably noticed the important point above: the tree is modified on all accesses. This means that, in a concurrent setting, a "read lock" is not sufficient to access the data structure. Have you double checked that your problem is not related to a concurrent or unsynchronized access to a splay tree, somewhere in SableVM? Maybe I forgot to acquire a lock somewhere? Etienne -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Etienne G. <gag...@uq...> - 2004-02-23 04:07:53
|
Archie Cobbs wrote: > (This is my understanding, Etienne please correct if necessary...) Archie, you have given the same explanation I gave at my Ph.D. defense! I don't have anything to add. :-) Etienne -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Archie C. <ar...@de...> - 2004-02-23 01:56:57
|
Chris Pickett wrote: > As a side note, Etienne, it seems strange to me that under contention, > you don't inflate the actual thin lock being released in your algorithm, > but you inflate all of the other locks that the thread owns. I think it > means that if two threads were competing for just one lock, it would > never get inflated. It's not necessary to inflate the thin lock. If thread #1 owns it and thread #2 contends for it, thread #2 sets the contention flag and puts itself on thread #1's queue. Thread #1 will wake up thread #2 when it releases the lock (after seeing it in the queue), and then thread #2 will acquire the thin lock. Only if thread #1 sees threads waiting on *different* objects must thread #1 inflate the associated lock. In effect, the condition variable associated with thread #1 itself (env->contention.requester.cond) functions as the monitor in this case. I.e., if you didn't care about efficiency, you could just never inflate any locks, and instead rely on the owning thread's condition variable. But then you'd have efficiency problems in cases where one thread owned a bunch of different locks at the same time -- all other threads waiting on any of the held locks would get woken up anytime any one individual lock was released by the thread. With the hybrid thin/fat lock scheme, only the threads that ever should be woken up are. The inflating of the "other" locks is to avoid this inefficiency. But it's not necessary to do this for threads waiting on the actual lock being released, because you do indeed want those thread to wake up. (This is my understanding, Etienne please correct if necessary...) -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-22 22:35:08
|
Chris Pickett wrote: > Clark VERBRUGGE wrote: > >> That would fix the java heap. But are not internal sablevm operations >> more of the problem? Are all shared data accesses (reads or writes) >> already protected by mutexes? > > > I didn't answer your question. The answer is I think so, but I don't > know. I think certain things are declared as volatile so that explicit > mutexes aren't required, but that this may be broken. There might be > problems with both thin locks and the thread status variable, amongst > other things. We'll see ... I think the first step is to make the Java > heap work. After that we can try to break things again. > Just an update on what I've found out, but first a summary for those who don't want to read all this: The problems seem to be in thread startup, and there might be something wrong with splay_tree.c. I put barriers in what I think are the right places, so I don't think the problems are JMM-related. I could be wrong about this, especially if the splay_tree problem violates the JMM rules about GC (see the "miscellany" at the end of the cookbook). Details of JMM / locking studies follow, and after that some debugging information (limited, because debugging these intermittent errors is frustrating). I don't really have any useful leads on what needs to be fixed anymore, so I'm going to stop now. I think careful study and/or unit testing of splay_tree.c might help -- I've done neither of these things. One of the more repeatable errors involves an "impossible control flow" in interpreter.c -- but I can't get an instruction trace on this so it seems useless. As a side note, Etienne, it seems strange to me that under contention, you don't inflate the actual thin lock being released in your algorithm, but you inflate all of the other locks that the thread owns. I think it means that if two threads were competing for just one lock, it would never get inflated. Cheers, Chris ====================================================================== I implemented _svmf_store_load_barrier() in system.c. It only works for i386 at the moment. Functions to implement StoreStore, LoadLoad, and LoadStore barriers are necessary for other processors (these barriers are no-ops on x86-PO, and the Athlon MP known as tofu.cs.mcgill.ca is an x86-PO). See the JSR-133 cookbook for details. The mfence instruction is illegal on this processor, so I'm using a locked add instead (as per the JSR-133 cookbook) -- it should be portable across all x86-PO's. Calling pthread_mutex_lock()/unlock() also imposes a StoreLoad barrier. These are the changes I made in my sandbox: * Implemented _svmf_store_load_barrier() for i386 in system.c * Commented _svmf_enter_object_monitor() after having gone through Bacon's, Onodera's, and Etienne's papers thoroughly. It should be easy to understand _svmf_exit_object_monitor() based on it. * Inserted StoreLoad barriers either side of setting the contention bit in _svmf_enter_object_monitor() -- recommended by Onodera * Inserted StoreLoad barriers either side of clearing the contention bit in _svmf_exit_object_monitor() -- recommended by Onodera * Inserted StoreLoad barriers either side of freeing a thin lock in _svmf_exit_object_monitor() -- recommended by Onodera * Inserted a StoreLoad barrier AFTER decrementing the recursive count of a thin lock in _svmf_exit_object_monitor() -- recommended by JSR-133 This didn't fix the test cases I have. I then looked at putting StoreLoad barriers after volatile stores. Although information about volatile fields is available at method preparation time, it is annoying to try and make volatile-safe versions of all the (get|put)(field|static) instructions (especially since I don't believe this to be the cleanest solution). So I just (temporarily) put a StoreLoad barrier between *every* instruction in the main dispatch loop. This still didn't fix things. I think ALL of the JNI functions are safe because they all call _svmf_resuming_java() which immediately calls _svmm_compare_and_swap() (except for the ones in invoke_interface.c, but I don't think that's where the problem lies). The JSR-133 cookbook says Thread.start() and Thread.join() need barriers ... but ... we don't appear to have a Thread.join() method, and Thread.start() calls pthread_create(). No errors looking for Thread.join() either. Finally, the last relevant points of the miscellany at the end of the JSR-133 cookbook address synchronization with garbage collectors, but we have a stop-the-world collector and I don't see where this could be unsafe. I even tried disabling GC altogethers. Note that I got a couple of core dumps where things died in splay_tree.c. The problems seem to be in thread startup, and there might be something wrong with splay_tree.c. 1) sablevm-chris-switch-debug Incrementer 628 sablevm: INTERNAL ERROR (source file "interpreter.c", line 330): impossible control flow 500 Aborted (gdb) bt #0 0x4012c781 in kill () from /lib/libc.so.6 #1 0x400b2e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b3339 in raise () from /lib/libpthread.so.0 #3 0x4012dbe1 in abort () from /lib/libc.so.6 #4 0x4008ba78 in _svmh_fatal_error (filename=0x400954c1 "interpreter.c", linenumber=330, msg=0x4008e56a "impossible control flow") at fatal.c:29 #5 0x4006f391 in _svmf_interpreter (_env=0x8066df0) at interpreter.c:330 #6 0x400280cc in _svmh_invoke_static_virtualmachine_runthread (env=0x8066df0) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066df0) at thread.c:1427 #8 0x400b00ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d2d6a in clone () from /lib/libc.so.6 (gdb) q 2) sablevm-chris-switch-debug Incrementer (I don't have the stdout/stderr for this): (gdb) bt #0 0x40130781 in kill () from /lib/libc.so.6 #1 0x400b6e5e in pthread_kill () from /lib/libpthread.so.0 #2 0x400b7339 in raise () from /lib/libpthread.so.0 #3 0x40131be1 in abort () from /lib/libc.so.6 #4 0x4008e952 in _svmf_tree_splay_gc_map (proot=0x40098b01, node=0x14a) at splay_tree.c:408 #5 0x40070fd1 in _svmf_interpreter (_env=0x8066e00) at instructions_preparation_switch_threaded.c:286 #6 0x40028e2b in _svmh_invoke_static_virtualmachine_runthread (env=0x8066e00) at method_invoke.c:5065 #7 0x4001f897 in _svmf_thread_start (_env=0x8066e00) at thread.c:1427 #8 0x400b40ba in pthread_start_thread () from /lib/libpthread.so.0 #9 0x401d6d6a in clone () from /lib/libc.so.6 (gdb) 3) sablevm-chris-switch-debug ThreadStarter Exception in thread "Thread-1" java.lang.NullPointerException at IncrementRunnable.run (ThreadStarter.java:51) at java.lang.Thread.run (Thread.java:670) at java.lang.VMThread.callRun (VMThread.java:116) at java.lang.Thread.callRun (Thread.java:343) at java.lang.VirtualMachine.runThread (VirtualMachine.java:117) |
From: Chris P. <chr...@ma...> - 2004-02-20 19:42:24
|
Clark VERBRUGGE wrote: > That would fix the java heap. But are not internal sablevm operations > more of the problem? Are all shared data accesses (reads or writes) > already protected by mutexes? I didn't answer your question. The answer is I think so, but I don't know. I think certain things are declared as volatile so that explicit mutexes aren't required, but that this may be broken. There might be problems with both thin locks and the thread status variable, amongst other things. We'll see ... I think the first step is to make the Java heap work. After that we can try to break things again. Cheers, Chris |
From: Chris P. <chr...@ma...> - 2004-02-20 19:33:00
|
Clark VERBRUGGE wrote: >>uniprocessor case). If they are on separate processors, in order to >>meet the requirements of the JMM, each time a lock is acquired by a > > > I may have missed some earlier emails on this, but in case it wasn't > mentioned already, Doug Lea's JSR-133 cookbook shows some specific > instructions for several architectures, along with precisely where they > need to be to ensure everything respects the (new) JMM: > > http://gee.cs.oswego.edu/dl/jmm/cookbook.html > > A simple/sub-optimal approach to see if it helps is not too hard (for > intel, it looks like mfence before/after each load/store of a java > volatile should do it, given the cmpxchg in monitor ops). Inserting > appropriate sync instructions *efficiently* would make an interested > project by itself... > > That would fix the java heap. But are not internal sablevm operations > more of the problem? Are all shared data accesses (reads or writes) > already protected by mutexes? Hi Clark, Replying to your mail publicly to sablevm-developer. Thanks for the reminder about Doug Lea's cookbook. It actually indicates that MFENCE is needed for unlocking a thin lock only (no C&S), because C&S already includes a memory barrier: From the IA-32 manual on LOCK: For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception: load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized. I missed this earlier on. However, I don't know if or how SableVM handles volatile Java variables. Is somebody able to comment on this? If there is no explicit treatment of volatile variables in SableVM, perhaps it is possible to enforce the JMM rules by inserting dummy method calls (using Soot) wherever an MB is needed. If these method calls are synchronized, then in the absense of a VM that recognizes them and translates them into VM-specific MB instructions, they will still enforce an MB, although it will be considerably slower. Cheers, Chris |
From: Greg A. <de...@ga...> - 2004-02-20 18:03:14
|
Package: sablevm Version: 1.0.9+svn20040120-2 Severity: important Tags: sid http://packages.debian.org/unstable/virtual/java2-runtime states that sablevm provides java2-runtime. I don't know how that page was generated but it appears to be incorrect. Is that page libelous, or does sablevm provide a java2-runtime and just fails to advertise it? from /var/lib/apt/lists/ftp.us.debian.org_debian_dists_unstable_main_binary-i386_Packages: Provides: java-virtual-machine, java1-runtime, java-runtime Note that /usr/share/doc/debian-policy/virtual-package-names-list.txt.gz from debian-policy 3.6.1.0 indicates that java-runtime is not a valid virtual package: Java and virtual machines ------------------------- java-compiler a java compiler, for Java version 1 java2-compiler a java compiler, for Java version 2 java-virtual-machine a JAVA virtual machine java1-runtime a Java runtime environment, Java version 1 java2-runtime a Java runtime environment, Java version 2 Does sablevm really implement a novel variety of java that cannot be described in the approved virtual packages? Or is this the result perhaps of an accidental application of a debian-experimental change to debian-unstable? Note that this makes all packages with Depends: java2-runtime uninstallable, as sablevm is the only installation candidate. If anyone cares to tell my ignorant self how to tell debian that I have installed java2-runtime on my own, I'd be much obliged. :) Thanks, - Greg -- System Information: Debian Release: 3.0 Architecture: i386 Kernel: Linux stallion 2.4.24 #4 Wed Feb 4 10:01:12 EST 2004 i686 Locale: LANG=C, LC_CTYPE=C Versions of packages sablevm depends on: ii java-common 0.14 Base of all Java packages ii libc6 2.3.2.ds1-11 GNU C Library: Shared libraries an ii libffi2 1:3.3.3-1 Foreign Function Interface library ii libltdl3 1.5.2-1 A system independent dlopen wrappe ii libpopt0 1.7-2 lib for parsing cmdline parameters ii libsablevm1 1.0.9+svn20040120-2 Free implementation of JVM second ii unzip 5.50-1 De-archiver for .zip files -- no debconf information |
From: Archie C. <ar...@de...> - 2004-02-20 15:09:19
|
Chris Pickett wrote: > On an SMP machine, the "main memory" is the non-cache heap memory, > visible to all processors. Threads may reside on the same processor, > and if this is the case, their "working memories" are also visible to > each other, and no problems arise (functionally identical to the > uniprocessor case). If they are on separate processors, in order to > meet the requirements of the JMM, each time a lock is acquired by a > thread, ALL of the lines brought into the cache by the current thread as > a result of reading values from the Java heap must be flushed: they are > written back to main memory if they were touched by the thread, > otherwise they are simply invalidated. When a lock is released, ONLY > those lines associated with the thread that have been modified since > acquiring the lock need flushing. > > So ... assuming that's now correct, I think there are three things that > we might do to consider (some or all of which may be gibberish): > > 1) Flush the entire processor cache as part of each MONITORENTER and > MONITOREXIT, or when entering or leaving a synchronized method. This > would involve calling / executing one of: > a) WBINVD (not available in user mode), > b) CLFLUSH on the entire cache (one line at a time), > c) a kernel whole-cache flush routine, > d) flooding the cache by reading in a bunch of non-Java-heap data. > > 2) Keep track of which Java heap addresses are read / written by a > thread, and flush only the cache lines that match those addresses as > part of MONITORENTER / MONITOREXIT, or when entering or leaving a > synchronized method. This would involve calling / executing: > a) CLFLUSH for each line > b) a line-specific kernel cache flush routine. > > 3) Use the memory barrier instructions: > a) MFENCE on each (Java only?) lock/unlock ensures that all loads and > stores occurring before the lock/unlock are globally visible before any > load or store that follows the MFENCE > b) *** while it appears that SFENCE (identical to MFENCE except only > stores are serialized) might be appropriate for the unlock operation, > this would mean a load operation depending on a store ordered before the > MFENCE might occur out-of-order, which would be bad. *** > > ........ > > Finally: After I wrote this, I looked again at question #118 of the > comp.programming.threads FAQ, and it seems to agree with what I've > written, and also makes me think that method (3) is the best. That is consistent with my understanding as well. I think #3 is best too, and it's probably OK to "punt" and say that a processor-specific instruction sequence will be required for the read and write barriers (and therefore an additional porting task). FYI the Linux kernel has examples of asm() statements that create memory barriers for all its supported architectures. -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-20 08:12:59
|
Archie Cobbs wrote: > Chris Pickett wrote: > >>"Locking any lock conceptually flushes all variables from a thread's >>working memory, and unlocking any lock forces the writing out to main >>memory of all variables that the thread has assigned." >> >>I'm pretty sure that when the spec says "working memory" it does not >>mean "processor cache" but "thread-local heap". SableVM doesn't have >>thread-local heaps (we discussed it the other day), and to me a large >>part of the JMM appears to be unimportant for SableVM. > > > I think that's mistaken... by "working memory" they just mean > a conceptual "working memory" that only the one thread has access to. > I.e., the processor cache metaphor is a good one here. > Thread-local heaps are a different topic I think. Okay, I think I finally understand. Thank-you all for your patience ... and I apologize for all the long emails (they are coming to an end, as I think a reasonable solution is close). I'll first explain my current conception of things -- I'd be grateful for any comment as to whether this sounds right or not. In SableVM, all threads access the same heap, or "main memory". At the hardware level, when a heap memory location is read into the cache, this is the same operation as bringing the value into the Java thread's "working memory". On a uniprocessor, there is only one cache, so in effect all threads' "working memories" are visible to each other (and so it is as if there are no working memories at all). Okay, actually the cache might consist of L1 and L2, but it doesn't matter w.r.t. visibility. On an SMP machine, the "main memory" is the non-cache heap memory, visible to all processors. Threads may reside on the same processor, and if this is the case, their "working memories" are also visible to each other, and no problems arise (functionally identical to the uniprocessor case). If they are on separate processors, in order to meet the requirements of the JMM, each time a lock is acquired by a thread, ALL of the lines brought into the cache by the current thread as a result of reading values from the Java heap must be flushed: they are written back to main memory if they were touched by the thread, otherwise they are simply invalidated. When a lock is released, ONLY those lines associated with the thread that have been modified since acquiring the lock need flushing. So ... assuming that's now correct, I think there are three things that we might do to consider (some or all of which may be gibberish): 1) Flush the entire processor cache as part of each MONITORENTER and MONITOREXIT, or when entering or leaving a synchronized method. This would involve calling / executing one of: a) WBINVD (not available in user mode), b) CLFLUSH on the entire cache (one line at a time), c) a kernel whole-cache flush routine, d) flooding the cache by reading in a bunch of non-Java-heap data. 2) Keep track of which Java heap addresses are read / written by a thread, and flush only the cache lines that match those addresses as part of MONITORENTER / MONITOREXIT, or when entering or leaving a synchronized method. This would involve calling / executing: a) CLFLUSH for each line b) a line-specific kernel cache flush routine. 3) Use the memory barrier instructions: a) MFENCE on each (Java only?) lock/unlock ensures that all loads and stores occurring before the lock/unlock are globally visible before any load or store that follows the MFENCE b) *** while it appears that SFENCE (identical to MFENCE except only stores are serialized) might be appropriate for the unlock operation, this would mean a load operation depending on a store ordered before the MFENCE might occur out-of-order, which would be bad. *** ........ Finally: After I wrote this, I looked again at question #118 of the comp.programming.threads FAQ, and it seems to agree with what I've written, and also makes me think that method (3) is the best. http://www.lambdacs.com/cpt/FAQ.html (careful, it will probably kill Mozilla, lynx is better) Cheers, Chris |
From: Archie C. <ar...@de...> - 2004-02-20 06:26:17
|
For anyone who may be interested... Announcing JC 1.0.0. JC is a Java virtual machine implementation that converts class files into C source files using the Soot Java bytecode analysis framework, compiles them with GCC, and loads them using a built-in ELF object file loader. JC utilizes the GNU Classpath class library and provides support for ``most'' features you would expect such as reflection, user class loaders, etc. More info available at: http://jcvm.sourceforge.net/ Cheers, -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Archie C. <ar...@de...> - 2004-02-20 05:53:35
|
Chris Pickett wrote: > "Locking any lock conceptually flushes all variables from a thread's > working memory, and unlocking any lock forces the writing out to main > memory of all variables that the thread has assigned." > > I'm pretty sure that when the spec says "working memory" it does not > mean "processor cache" but "thread-local heap". SableVM doesn't have > thread-local heaps (we discussed it the other day), and to me a large > part of the JMM appears to be unimportant for SableVM. I think that's mistaken... by "working memory" they just mean a conceptual "working memory" that only the one thread has access to. I.e., the processor cache metaphor is a good one here. Thread-local heaps are a different topic I think. -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-20 04:48:16
|
Etienne Gagnon wrote: > Hi All, > > First, I'll be away for a little over 2 weeks, so don't expect any > reply from me. Have fun! > See comments below. > > Chris Pickett wrote: > >> ... >> For the Pentium 4, Intel Xeon, and P6 family processors, if the area >> of memory being locked during a LOCK operation is cached in the >> processor that is performing the LOCK operation as write-back memory >> and is completely contained in a cache line, the processor may not >> assert the >> LOCK# signal on the bus.... >> ... So that means it looks like the problem is elsewhere (e.g. writes >> to "xxx.flag" that Etienne mentioned). ... > > > Hmmm... I'm not convinced. Have you actually read the JVM spec for > JMM (memory model)? Acquiring a thin lock should cause all UNRELATED > memory content to be gotten from main memory after the lock, and > unlocking should do the reverse. SableVM does not do any of this > currently. You mean in Section 8.9, right? "Locking any lock conceptually flushes all variables from a thread's working memory, and unlocking any lock forces the writing out to main memory of all variables that the thread has assigned." I'm pretty sure that when the spec says "working memory" it does not mean "processor cache" but "thread-local heap". SableVM doesn't have thread-local heaps (we discussed it the other day), and to me a large part of the JMM appears to be unimportant for SableVM. I think the current problem is related to multithreading in SableVM at the C / pthreads / MP level, but not at the JMM level. Cheers, Chris |
From: Etienne G. <gag...@uq...> - 2004-02-20 03:11:05
|
Hi All, First, I'll be away for a little over 2 weeks, so don't expect any reply from me. See comments below. Chris Pickett wrote: > ... > For the Pentium 4, Intel Xeon, and P6 family processors, if the area of > memory being locked during a LOCK operation is cached in the processor > that is performing the LOCK operation as write-back memory and is > completely contained in a cache line, the processor may not assert the > LOCK# signal on the bus.... > ... > So that means it looks like the problem is elsewhere (e.g. writes to > "xxx.flag" that Etienne mentioned). ... Hmmm... I'm not convinced. Have you actually read the JVM spec for JMM (memory model)? Acquiring a thin lock should cause all UNRELATED memory content to be gotten from main memory after the lock, and unlocking should do the reverse. SableVM does not do any of this currently. It's not a problem on uniprocessors, but I expect this to be quite a problem on MPs. My language, here, is quite fuzzy; it would be best explained in terms of "read/write" barriers, yet I have to find a "reliable" definition of "barriers" which would be consistent across MPs. Etienne -- Etienne M. Gagnon, Ph.D. http://www.info.uqam.ca/~egagnon/ SableVM: http://www.sablevm.org/ SableCC: http://www.sablecc.org/ |
From: Chris P. <chr...@ma...> - 2004-02-20 01:11:04
|
David B=E9langer wrote: > On Thu, Feb 19, 2004 at 05:10:54PM -0500, Chris Pickett wrote: >=20 >>The WBINVD instruction is a privileged instruction. When the processor = >>is running in protected mode, the CPL of a program or procedure must be= =20 >>0 to execute this instruction. This instruction is also a serializing=20 >>instruction (see ?Serializing Instructions? in Chapter 8 of the IA-32=20 >>Intel Architecture Software Developer?s Manual, Volume 3). >=20 >=20 >>I'm not sure if this is a problem, but if not, maybe all that's require= d=20 >>is WBINVD in the C&S for i386? >> >=20 >=20 > Yes, it means it can be executed only in kernel mode that is the > code must be located either in the kernel or a kernel module. >=20 > You need to find the instructions for user mode... >=20 I made a mistake. Although the CMPXCHG instruction doesn't specify any=20 cache effets on its own, it can be preceded by a LOCK operation. And=20 yes, the SableVM code has this LOCK: __asm__ __volatile__ ("lock\n\t" "cmpxchgl %3, %1\n\t" "sete %0" :"=3Dq" (result), "=3Dm" (*pword), "=3Da"(current_= value) :"r" (new_value), "m" (*pword), "a" (old_value) :"memory"); From the IA-32 System Programming Guide: 7.1.4. Effects of a LOCK Operation on Internal Processor Caches For the Intel486 and Pentium processors, the LOCK# signal is always=20 asserted on the bus during a LOCK operation, even if the area of memory=20 being locked is cached in the processor. For the Pentium 4, Intel Xeon, and P6 family processors, if the area of=20 memory being locked during a LOCK operation is cached in the processor=20 that is performing the LOCK operation as write-back memory and is=20 completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location=20 internally and allow it=92s cache coherency mechanism to insure that the = operation is carried out atomically. This operation is called =93cache=20 locking.=94 The cache coherency mechanism automatically prevents two or=20 more processors that have cached the same area of memory from=20 simultaneously modifying data in that area. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D The manual then explains how snooping is used to maintain cache=20 coherency. The only relevant user mode instruction is CLFLUSH, which=20 flushes a single cache line, but this is intended as an optimization=20 only. So, AFAICT, the existing C&S is fine for i686, unless for some=20 reason either: a) the processor is set to the startup CD=3D1, NW=3D1 mode (see Table 10-= 5=20 in the programming guide for more info), which does not maintain cache=20 coherency, or b) there is some other device on the system bus that does not perform=20 cache snooping to maintain coherency. =2E.. but then ... I think other multithreaded applications would crash. So that means it looks like the problem is elsewhere (e.g. writes to=20 "xxx.flag" that Etienne mentioned). Still, it's good to have made sure. = I'll investigate putting assembly locks around the unsynchronized piece= s. Cheers, Chris P.S. If any of you don't want the CC's on this thread anymore, let me=20 know; the SF server has been a bit unreliable semi-lately. |
From: David <db...@cs...> - 2004-02-19 22:42:15
|
On Thu, Feb 19, 2004 at 05:10:54PM -0500, Chris Pickett wrote: >=20 > The WBINVD instruction is a privileged instruction. When the processor=20 > is running in protected mode, the CPL of a program or procedure must be= =20 > 0 to execute this instruction. This instruction is also a serializing=20 > instruction (see ?Serializing Instructions? in Chapter 8 of the IA-32=20 > Intel Architecture Software Developer?s Manual, Volume 3). >=20 > I'm not sure if this is a problem, but if not, maybe all that's require= d=20 > is WBINVD in the C&S for i386? >=20 Yes, it means it can be executed only in kernel mode that is the code must be located either in the kernel or a kernel module. You need to find the instructions for user mode... David --- David B=E9langer Graduate Student School of Computer Science McGill University Office: MC226 Web page: http://www.cs.mcgill.ca/~dbelan2/ Public key: http://www.cs.mcgill.ca/~dbelan2/public_key.txt |
From: Chris P. <chr...@ma...> - 2004-02-19 22:15:19
|
Archie Cobbs wrote: > Chris Pickett wrote: >=20 >>I started looking at the POSIX 1003.1c (pthreads) spec (it's >>available online) and also at the comp.programming.threads FAQ (1 Mb >>html file kills Mozilla on my machine, better to download) ... and >>discovered a few interesting things: >> >>1) The only way to ensure cache coherency in a portable manner is to us= e >>the pthreads synchronization functions (e.g. lock and unlock). So I >>think that means there is no need for us to consider the Linux kernel >>cache flush architecture, nor any processor-specific cache flush >>instructions. >=20 >=20 > On a related note: Java semantics imply a read barrier at MONITORENTER > and a write barrier with MONITOREXIT. With fat locks, you get this > automatically because they are implemented using pthread mutexes. > But with thin locks where there is no contention, technicallly SableVM > is at fault because it doesn't explicitly impose the read/write barrier= s > (does it?). That's what I think :( Etienne wrote about it here: http://lists.debian.org/debian-ia64/2003/debian-ia64-200302/msg00035.html= (first hit if you google for "thin locks smp"!) and the description of locks in SableVM is here: http://www.usenix.org/publications/library/proceedings/jvm01/gagnon/gagno= n_html/node14.html After reading the comp.programming.threads FAQ stuff (just search the=20 document for "cache"), they say that although workable hacks exist, if=20 you want any portability or guarantees you need to use POSIX only, and=20 you should only use the hacks if you know /exactly/ what you're doing.=20 But at the same time, it sounds like strictly-POSIX thin locks don't=20 exist ... so it might be easier to try and introduce a cache flush=20 instruction or system cache flush call in places. There's two solutions I can see: 1) Make the current thin locks optional OR 2) Introduce explicit cache flushing where necessary Personally, I would be happy enough with (1), since my speculative=20 multithreading work only needs to show relative speedup (and indeed, the = faster an "unmodified" SableVM is, the less that relative speedup will=20 be ...), but I'm actually just eager to take the path of least resistance= :) > On i386 it works out anyway because I think the compare-and-swap > sequence enforces a memory barrier. But in general that's not true. Well, SableVM doesn't work on an Athlon MP 2000+, which is i686. But=20 I'm not sure if it's because of a broken C&S or not. If it IS because=20 of a broken C&S, that's a good thing; however, if the C&S is /already/=20 imposing an MB, then that's bad because it means the problem is=20 elsewhere. I think. (more reading ensues) I looked up the IA-32 instruction set reference (split in 2 parts): http://developer.intel.com/design/pentium4/manuals/253666.htm http://developer.intel.com/design/pentium4/manuals/253667.htm CMPXCHG doesn't mention flushing the processor's cache. INVD ignores cache contents and invalidates the cache. WBINVD writes back cache contents and invalidates the cache, and signals = other processors to do the same. However, the documentation says: The WBINVD instruction is a privileged instruction. When the processor=20 is running in protected mode, the CPL of a program or procedure must be=20 0 to execute this instruction. This instruction is also a serializing=20 instruction (see =93Serializing Instructions=94 in Chapter 8 of the IA-32= =20 Intel Architecture Software Developer=92s Manual, Volume 3). I'm not sure if this is a problem, but if not, maybe all that's required = is WBINVD in the C&S for i386? It would also be nice if we didn't have to call WBINVD on a uniprocessor = =2E.. > I could be wrong about all this but this what memory recalls. Whether or not you are, thanks for discussing it, it's always helpful. Cheers, Chris |
From: Archie C. <ar...@de...> - 2004-02-19 20:42:24
|
Chris Pickett wrote: > I started looking at the POSIX 1003.1c (pthreads) spec (it's > available online) and also at the comp.programming.threads FAQ (1 Mb > html file kills Mozilla on my machine, better to download) ... and > discovered a few interesting things: > > 1) The only way to ensure cache coherency in a portable manner is to use > the pthreads synchronization functions (e.g. lock and unlock). So I > think that means there is no need for us to consider the Linux kernel > cache flush architecture, nor any processor-specific cache flush > instructions. On a related note: Java semantics imply a read barrier at MONITORENTER and a write barrier with MONITOREXIT. With fat locks, you get this automatically because they are implemented using pthread mutexes. But with thin locks where there is no contention, technicallly SableVM is at fault because it doesn't explicitly impose the read/write barriers (does it?). On i386 it works out anyway because I think the compare-and-swap sequence enforces a memory barrier. But in general that's not true. I could be wrong about all this but this what memory recalls. -Archie __________________________________________________________________________ Archie Cobbs * CTO, Awarix * http://www.awarix.com |
From: Chris P. <chr...@ma...> - 2004-02-19 18:13:26
|
Etienne Gagnon wrote: > Chris Pickett wrote: > >> However, I now have another test case (see attached), although I've only >> seen this bug manifest itself on a multiprocessor. It might affect >> uniprocessors in more complex scenarios. > > > Multi-processor: part of the critical sablevm code for locking does NOT > take > into account "cache issues". For instance, "xxx.flag" is read and written > assuming that any change is visible to other threads without > synchronization, > which is probably NOT the case on a multiprocessor. You're right, it's not the case ... see below. > So, unless you can actually get your "bug" to manifest itself on a > uniprocessor, > I wouldn't worry much about it. > > Now, if you really want to get things running on a multi-processor, you > should > start investigating cache issues. :-) [WARNING: Not easy. In fact, the > Java > Memory Model is bronken on multi-processors...] I started looking at the POSIX 1003.1c (pthreads) spec (it's available online) and also at the comp.programming.threads FAQ (1 Mb html file kills Mozilla on my machine, better to download) ... and discovered a few interesting things: 1) The only way to ensure cache coherency in a portable manner is to use the pthreads synchronization functions (e.g. lock and unlock). So I think that means there is no need for us to consider the Linux kernel cache flush architecture, nor any processor-specific cache flush instructions. 2) (a bit on "volatile" lifted verbatim): You do NOT need volatile for threaded programming. You do need it when you share data between "main code" and signal handlers, or when sharing hardware registers with a device. In certain restricted situations, it MIGHT help when sharing unsynchronized data between threads (but don't count on it -- the semantics of volatile" are too fuzzy). If you need volatile to share data, protected by POSIX synchronization objects, between threads, then your implementation is busted. 3) No unsynchronized operation on shared data, even if it takes only one assembly instruction, is truly safe, with the exception of "one-shot-flags", where data changes only in one direction and it only changes once, and the actual changed-to value doesn't matter. Although it doesn't follow exactly the same semantics as a "one-shot-flag", the preparation sequence REPLACE operation is safe by similar logic. Cheers, Chris |
From: Chris P. <chr...@ma...> - 2004-02-19 06:01:53
|
Chris Pickett wrote: >> It's available at: >> >> http://gadek.debian.net/FOSDEM >> > s/licenced/licensed/g/ (vanity edit) -- no trailing slash ... |
From: Chris P. <chr...@ma...> - 2004-02-19 05:58:20
|
Grzegorz B. Prokopski wrote: > Hi all, > > I am leaving tomorrow afternoon. I hope to have a bit of time around > noon to refine some things in the presentation. If you can - *please* > read it, send me your opinions (may be privately if you want), > corrections, ... If you do that later - send them too, but I don't know > whether I'll be able to do any changes to it before it's aired. > > It's available at: > > http://gadek.debian.net/FOSDEM > > (apparently xpdf doesn't handle transparent pngs, use OO.o if possible) > the sablevm logo is too pixelated in the top right corner of pages. s/licenced/licensed/g/ the m4 macro screenshot is too blurry. in pdf format using acrobat reader, the green concentric rectangles take forever to display. it's fine for your presentation, but not for distribution, in my opinion. it looks good! although i think you make it sound like too much of a "toy" vm, when you say "we especially need help with real benchmarks" (or something similar) -- sablevm can run a lot of things. more like, we need help putting the vm through the ringer so that it never fails. cheers, and good luck, chris p.s. i don't even have write access to staging at the moment (nor have i requested it), so i think my contribution to finishing the release will be testing tarballs. |