From: SourceForge.net <no...@so...> - 2007-05-30 00:47:26
|
Bugs item #1725306, was opened at 2007-05-25 12:34 Message generated for change (Comment added) made by steveb-oss You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=712768&aid=1725306&group_id=128805 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: optimizing compiler Group: None Status: Closed Resolution: Fixed Priority: 8 Private: No Submitted By: Steve Blackburn (steveb-oss) Assigned to: Nobody/Anonymous (nobody) Summary: Header corruption seen in eclipse Initial Comment: When running eclipse with a BaseAdaptiveMarkSweep build and a 1G heap, eclipse will fairly often die with the stack trace below (fairly often being about 30% of runs). VM_Lock says we have 2500 locks, so either eclipse generates an insane amount of lock contention, or there is some other problem. To reproduce, do something like: for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19; do echo $i; jikesrvm/dist/BaseAdaptiveMarkSweep_ia32/rvm -Xms1G -Xmx1G -X:processors=2 -X:gc:ignoreSystemGC=true -jar dacapo-2006-10-MR2.jar eclipse; done A variety of manifestations of the bug: 1. Cannot grow lock array greater than maximum possible index -- Stack -- Lorg/jikesrvm/VM; sysFail(Ljava/lang/String;)V at line 1988 Lorg/jikesrvm/scheduler/VM_Lock; growLocks()V at line 441 Lorg/jikesrvm/scheduler/VM_ThinLock; lock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 131 Lorg/jikesrvm/scheduler/VM_ThinLock; inlineLock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 59 Lorg/eclipse/jdt/internal/core/JavaModelManager; getPerProjectInfo(Lorg/eclipse/core/resources/IProject;Z)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo; at line 1204 Lorg/eclipse/jdt/internal/core/JavaModelManager; getPerProjectInfoCheckExistence(Lorg/eclipse/core/resources/IProject;)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo; at line 1220 2. Exception while running benchmarks!org.eclipse.core.internal.resources.ResourceException: Errors during build. org.eclipse.core.internal.resources.ResourceException: Errors during build. at org.eclipse.core.internal.events.BuildManager.build(BuildManager.java:305) <snip snip> at Harness.main(Harness.java:5) Exception in thread "Java indexing": java.lang.NullPointerException at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133) at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfoCheckExistence(JavaModelManager.java:1220) at org.eclipse.jdt.internal.core.JavaProject.getPerProjectInfo(JavaProject.java:1792) 3. [GC 7 Start 54.28 s 476964KB -> 202432KB 2194.29 ms] at Harness.main(Harness.java:5) Exception in thread "Java indexing": java.lang.NullPointerException at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133) at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204) at Harness.main(Harness.java:5) Exception in thread "Java indexing": java.lang.NullPointerException at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133) at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204) 4. [GC 12 Start 89.81 s 318064KB -> 220072KB 2097.20 ms] Exception in thread "Java indexing": java.lang.NullPointerException at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133) at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204) at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfoCheckExistence(JavaModelManager.java:1220) at org.eclipse.jdt.internal.core.JavaProject.getPerProjectInfo(JavaProject.java:1792) at org.eclipse.jdt.internal.core.JavaProject.getOptions(JavaProject.java:1560) ---------------------------------------------------------------------- >Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-30 10:47 Message: Logged In: YES user_id=1215444 Originator: YES I am confident that the particular bug reported on this tracker item is fixed, and that you're reporting something different. Eclipse is large and very complicated and has exposed a large number of bugs in Jikes RVM (probably a dozen or more?). Daniel and I are aware of at least two outstanding bugs exposed by eclipse: 1. a problem with the PLOS 2. what appears to be a GC map problem. You seem to have seen 2. I take it that you're not suggesting that this is a new bug, but that removal of the other bugs exposes this one. Our plans are to narrow 1. & 2. down a little further then start new tracker items as appropriate. Any more info you have will be gladly received, but we should use a different tracker. Daniel started testing with the gcmap-sanity this morning, so he's probably seeing the same thing now. ---------------------------------------------------------------------- Comment By: Peter Donald (peter_donald) Date: 2007-05-30 10:39 Message: Logged In: YES user_id=1642927 Originator: NO After this change I now get the following output when running eclipse under gcmap-sanity/gcstress_Opt_1/dacapo configuration. So It looks like there is some sort of problem with locking code and it also exposes a gcmap problem of some sort. 1[14-daemon] VM_Lock: unlock error: thin lock word = 0x00000000 1[14-daemon] VM_Lock: unlock error: thin lock word = 0x9d8006bc validRef: TIB outside heap, ref = 0x9d8006bc tib = 0x00000000 Invalid ref reported while scanning stack --- METHOD (OPT) Lorg/jikesrvm/scheduler/VM_ThinLock;.unlock (Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V --- fp = 0x6b55ac24 code base = 0x5b925540 code offset = 0x00000129 0x6b55aba0:REF=0x9d8006bc TIB=0x00000000 STATUS=0x00000000 (INVALID TIB: CLASS NOT ACCESSIBLE) 0x6b55abf0 (0x00000000): 0x6b55ac24 0x6b55abf4 (0x00000004): 0x5b925669 0x6b55abf8 (0x00000008): 0x00000000 0x6b55abfc (0x0000000c): 0x6e03c74c 0x6b55ac00 (0x00000010): 0x6f1ea9b8 0x6b55ac04 (0x00000014): 0x57040260 0x6b55ac08 (0x00000018): 0x00000000 0x6b55ac0c (0x0000001c): 0x6b55ac3c 0x6b55ac10 (0x00000020): 0x6f380634 0x6b55ac14 (0x00000024): 0x00008d6e 0x6b55ac18 (0x00000028): 0x00380003 0x6b55ac1c (0x0000002c): 0x00000020 0x6b55ac20 (0x00000030): 0x000002a5 Dumping stack starting at frame with bad ref: -- Stack -- Lorg/jikesrvm/scheduler/VM_ThinLock; unlock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 210 Lorg/jikesrvm/objectmodel/VM_JavaHeader; genericUnlock(Ljava/lang/Object;)V at line 580 Lorg/jikesrvm/objectmodel/VM_ObjectModel; genericUnlock(Ljava/lang/Object;)V at line 471 ... ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-30 07:10 Message: Logged In: YES user_id=1215444 Originator: YES I just updated to the svn head and given it a good shake. The bug has gone thanks to Dave's commit (r12375). I went out on a limb and tested FastAdaptive too ;-) We can now run eclipse to completion with processors=2 with MS & SS, BaseBase, BaseAdaptive and FastAdaptive. I'm closing the bug. Great teamwork! ---------------------------------------------------------------------- Comment By: Dave Grove (dgrove-oss) Date: 2007-05-30 03:46 Message: Logged In: YES user_id=1215435 Originator: NO committed a probably fix in 12375. I haven't rebuilt my GNU classpath yet with the fix for the EOF bug, so not sure if this "really" fixes eclipse but it seems like it should do the right thing... ---------------------------------------------------------------------- Comment By: Dave Grove (dgrove-oss) Date: 2007-05-30 00:38 Message: Logged In: YES user_id=1215435 Originator: NO I think something like this is right. If the compile time type of the locked object is an interface, we can't in general do an inline lock sequence (although under current object model we could...), but I'd like to maintain the abstraction for the non-fixed lock offset object model. Very nice detective work. ---------------------------------------------------------------------- Comment By: Daniel Frampton (dframpton-oss) Date: 2007-05-30 00:31 Message: Logged In: YES user_id=1215450 Originator: NO This also sounds good. Basically anything that ensures we hit the slow path call the lock instead of inlineLock. Of course considering we have a constant offset for everything with our current object model we could perform an inlineLock.... But correctness comes first. If we want to add an option that allows us to take advantage of fixed lock offsets we can do it later ;). ---------------------------------------------------------------------- Comment By: Ian Rogers (captain5050) Date: 2007-05-30 00:25 Message: Logged In: YES user_id=308843 Originator: NO We probably want to set the thinLockOffset for interfaces to max (-1) when they are resolved to indicate that they don't have them. This will avoid using the inline lock code and use the regular lock method that's use by the baseline compiler. I will look at this with Dave shortly. Ian ---------------------------------------------------------------------- Comment By: Daniel Frampton (dframpton-oss) Date: 2007-05-30 00:10 Message: Logged In: YES user_id=1215450 Originator: NO I see that for unresolved types this is already done. Attaching patch for a possible fix to OPT_ExpandRuntimeServices. No time to test and commit tonight. Will look at doing this tomorrow if no action overnight. File Added: possible-fix.patch ---------------------------------------------------------------------- Comment By: Daniel Frampton (dframpton-oss) Date: 2007-05-29 23:52 Message: Logged In: YES user_id=1215450 Originator: NO The bug seems to be that in OPT_ExpandRuntimeServices when generating the MONITORENTER code (and I assume exit also) the lockOffset is taken from the type of the object. The type is actually an interface, which means that the thinLockOffset is set as 0, instead of -8 (status word). This of course starts us off on a strange sequence of events leading to disaster. Only the optimizing compiler does this optimization (which is incorrect). I assume it should do some other code path for interfaces (and unresolved types) as we can not guarantee in general that all objects with a specific interface have the same lockOffset. I tested my hypothesis by (incorrectly) setting the thinLockOffset for VM_Class objects representing interfaces. This is clearly not correct, but demonstrates what the bug is. Of course it may be more complicated than that. But this is a very important hint. Note also this seems to fail at even opt level O0. ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-29 18:45 Message: Logged In: YES user_id=1215444 Originator: YES Oh, and Daniel and I figured we were unable to use OptTestHarness because of eclipse's classloader bizarreness, hence the patch. ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-29 16:32 Message: Logged In: YES user_id=1215444 Originator: YES Things have become a great deal clearer: The problem will manifest 3 ways: a. In a GC crash b. In a hang c. In a corrupt lock word seen by the locking mechanism (initial observation) b & c seem to interchange semi-randomly. a will happen if the heap is small enough that the GC trips over the broken header before the runtime We know the types of the objects: the corrupted object is of type java/util/HashMap, the value with which it is overwritten is of type java/util/HashMap$2. Most importantly, we can now reliably reproduce the problem with the patch below by forcing a single eclipse method to be opt compiled. If a BaseAdaptive image from the svn head is run with -X:aos:enable_recompilation=false, eclipse will reliably run to completion unless the patch below is applied, in which case it will always fail (either assertion failure or lockup). Furthermore, the problem now shows up with -X:processors=1. Daniel guessed that in light of this, the number of processors was actually only indirectly relevant (when processors > 1, the lock in this method gets contended it gets "hot" and so it gets optimized. So the problem is now easily repeatable on a uniprocessor. Also, the lock-up shows (near) zero CPU. I've looked over the IR of the offending method (below) with Daniel, and there's lot of it. We have not yet spotted anything obvious. Index: rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java =================================================================== --- rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java (revision 12372) +++ rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java (working copy) @@ -655,7 +655,9 @@ compilationInProgress = false; } } - if (VM_Controller.options.optIRC()) { + boolean funnyMethod = (method.getName().toString().equals("getPerProjectInfo") && method.getDeclaringClass().toString().contains("JavaModelManager")); + if (funnyMethod) VM.sysWriteln("Compiling a 'funny' method."); + if (VM_Controller.options.optIRC() || funnyMethod) { if (// will only run once: don't bother optimizing method.isClassInitializer() || // exception in progress. can't use opt compiler: The method: public PerProjectInfo getPerProjectInfo(IProject project, boolean create) { synchronized(this.perProjectInfos) { // use the perProjectInfo collection as its own lock PerProjectInfo info= (PerProjectInfo) this.perProjectInfos.get(project); if (info == null && create) { info= new PerProjectInfo(project); this.perProjectInfos.put(project, info); } return info; } } ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-29 12:38 Message: Logged In: YES user_id=1215444 Originator: YES A small update: the corruption is apparently due to a reference being written over the header word (ie the word values are always reasonable pointer values). As noted before, the problem is evident no matter what GC is used, but does not show up with BaseBase builds. ---------------------------------------------------------------------- Comment By: Dave Grove (dgrove-oss) Date: 2007-05-25 23:21 Message: Logged In: YES user_id=1215435 Originator: NO I agree. Lock index too big is one of the "classic" signs of memory corruption. The status word of an object get's smashed and it just happens that the locking code stumbles across it before anyone else. ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-25 17:01 Message: Logged In: YES user_id=1215444 Originator: YES With a bit more verbosity, I see this on failure: [GC 8 Start 68.11 s 656596KB -> 182200KB 2194.25 ms] Lock index out of range! Word: 0x832092f8 index: 51236 locks: 4097 vm internal error at: -- Stack -- Lorg/jikesrvm/VM; sysFail(Ljava/lang/String;)V at line 1988 Lorg/jikesrvm/VM; _assertionFailure(Ljava/lang/String;Ljava/lang/String;)V at line 527 Lorg/jikesrvm/VM; _assert(ZLjava/lang/String;Ljava/lang/String;)V at line 510 Lorg/jikesrvm/VM; _assert(Z)V at line 490 Lorg/jikesrvm/scheduler/VM_ThinLock; getLockIndex(Lorg/vmmagic/unboxed/Word;)I at line 55 Lorg/jikesrvm/scheduler/VM_ThinLock; lock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 154 Lorg/jikesrvm/scheduler/VM_ThinLock; inlineLock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 84 Lorg/eclipse/jdt/internal/core/JavaModelManager; getPerProjectInfo(Lorg/eclipse/core/resources/IProject;Z)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo; at line 1204 ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-25 16:26 Message: Logged In: YES user_id=1215444 Originator: YES Refactoring and assertions added in r12330 seem to suggest some issue with corruption. We see now that the problem is that when getting the lock index, the index is sometimes out of range. However, this same assertion is checked at the time the value is stored and that assertion is never triggered... Forcing VM_ThinkLock.getLockIndex() to be NoOptCompile did not seem to help. Further BaseBase testing continues to run eclipse without failure (many hours now on a fast machine without failure). ---------------------------------------------------------------------- Comment By: Steve Blackburn (steveb-oss) Date: 2007-05-25 12:36 Message: Logged In: YES user_id=1215444 Originator: YES This bug does not appear under BaseBase builds (in fact I have seen no bugs with eclipse under BaseBase, despite a lot of continuous testing in a loop.) ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=712768&aid=1725306&group_id=128805 |