[Jikesrvm-issues] [ jikesrvm-Bugs-1725306 ] Header corruption seen in eclipse

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1725306, was opened at 2007-05-25 12:34
Message generated for change (Comment added) made by steveb-oss
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=712768&aid=1725306&group_id=128805

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: optimizing compiler
Group: None
Status: Closed
Resolution: Fixed
Priority: 8
Private: No
Submitted By: Steve Blackburn (steveb-oss)
Assigned to: Nobody/Anonymous (nobody)
Summary: Header corruption seen in eclipse

Initial Comment:
When running eclipse with a BaseAdaptiveMarkSweep build and a 1G heap, eclipse will fairly often die with the stack trace below (fairly often being about 30% of runs).

VM_Lock says we have 2500 locks, so either eclipse generates an insane amount of lock contention, or there is some other problem.

To reproduce, do something like:

for i in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19; do echo $i; jikesrvm/dist/BaseAdaptiveMarkSweep_ia32/rvm -Xms1G -Xmx1G -X:processors=2 -X:gc:ignoreSystemGC=true -jar dacapo-2006-10-MR2.jar eclipse;
done

A variety of manifestations of the bug:


1.

Cannot grow lock array greater than maximum possible index
-- Stack --
   Lorg/jikesrvm/VM; sysFail(Ljava/lang/String;)V at line 1988
   Lorg/jikesrvm/scheduler/VM_Lock; growLocks()V at line 441
   Lorg/jikesrvm/scheduler/VM_ThinLock; lock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 131
   Lorg/jikesrvm/scheduler/VM_ThinLock; inlineLock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 59
   Lorg/eclipse/jdt/internal/core/JavaModelManager; getPerProjectInfo(Lorg/eclipse/core/resources/IProject;Z)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo; at line 1204
   Lorg/eclipse/jdt/internal/core/JavaModelManager; getPerProjectInfoCheckExistence(Lorg/eclipse/core/resources/IProject;)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo; at line 1220

2.

Exception while running benchmarks!org.eclipse.core.internal.resources.ResourceException: Errors during build.
org.eclipse.core.internal.resources.ResourceException: Errors during build.
        at org.eclipse.core.internal.events.BuildManager.build(BuildManager.java:305)
<snip snip>
        at Harness.main(Harness.java:5)
Exception in thread "Java indexing": java.lang.NullPointerException
        at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133)
        at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfoCheckExistence(JavaModelManager.java:1220)
        at org.eclipse.jdt.internal.core.JavaProject.getPerProjectInfo(JavaProject.java:1792)

3.

[GC 7 Start 54.28 s   476964KB -> 202432KB   2194.29 ms]
        at Harness.main(Harness.java:5)
Exception in thread "Java indexing": java.lang.NullPointerException
        at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133)
        at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204)
        at Harness.main(Harness.java:5)
Exception in thread "Java indexing": java.lang.NullPointerException
        at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133)
        at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204)

4.

[GC 12 Start 89.81 s   318064KB -> 220072KB   2097.20 ms]
Exception in thread "Java indexing": java.lang.NullPointerException
        at org.jikesrvm.scheduler.VM_ThinLock.lock(VM_ThinLock.java:133)
        at org.jikesrvm.scheduler.VM_ThinLock.inlineLock(VM_ThinLock.java:59)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfo(JavaModelManager.java:1204)
        at org.eclipse.jdt.internal.core.JavaModelManager.getPerProjectInfoCheckExistence(JavaModelManager.java:1220)
        at org.eclipse.jdt.internal.core.JavaProject.getPerProjectInfo(JavaProject.java:1792)
        at org.eclipse.jdt.internal.core.JavaProject.getOptions(JavaProject.java:1560)



----------------------------------------------------------------------

>Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-30 10:47

Message:
Logged In: YES 
user_id=1215444
Originator: YES

I am confident that the particular bug reported on this tracker item is
fixed, and that you're reporting something different.  Eclipse is large and
very complicated and has exposed a large number of bugs in Jikes RVM
(probably a dozen or more?).

Daniel and I are aware of at least two outstanding bugs exposed by
eclipse:

1. a problem with the PLOS
2. what appears to be a GC map problem.

You seem to have seen 2.  I take it that you're not suggesting that this
is a new bug, but that removal of the other bugs exposes this one.

Our plans are to narrow 1. & 2. down a little further then start new
tracker items as appropriate.

Any more info you have will be gladly received, but we should use a
different tracker.  Daniel started testing with the gcmap-sanity this
morning, so he's probably seeing the same thing now.

----------------------------------------------------------------------

Comment By: Peter Donald (peter_donald)
Date: 2007-05-30 10:39

Message:
Logged In: YES 
user_id=1642927
Originator: NO

After this change I now get the following output when running eclipse
under gcmap-sanity/gcstress_Opt_1/dacapo configuration. So It looks like
there  is some sort of problem with locking code and it also exposes a
gcmap problem of some sort.

1[14-daemon] VM_Lock: unlock error: thin lock word =  0x00000000
1[14-daemon] VM_Lock: unlock error: thin lock word =  0x9d8006bc
validRef: TIB outside heap, ref = 0x9d8006bc tib = 0x00000000

Invalid ref reported while scanning stack

--- METHOD (OPT) Lorg/jikesrvm/scheduler/VM_ThinLock;.unlock
(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V
--- fp = 0x6b55ac24     code base = 0x5b925540     code offset =
0x00000129
0x6b55aba0:REF=0x9d8006bc TIB=0x00000000 STATUS=0x00000000 (INVALID TIB:
CLASS NOT ACCESSIBLE)
0x6b55abf0 (0x00000000):   0x6b55ac24 
0x6b55abf4 (0x00000004):   0x5b925669 
0x6b55abf8 (0x00000008):   0x00000000 
0x6b55abfc (0x0000000c):   0x6e03c74c 
0x6b55ac00 (0x00000010):   0x6f1ea9b8 
0x6b55ac04 (0x00000014):   0x57040260 
0x6b55ac08 (0x00000018):   0x00000000 
0x6b55ac0c (0x0000001c):   0x6b55ac3c 
0x6b55ac10 (0x00000020):   0x6f380634 
0x6b55ac14 (0x00000024):   0x00008d6e 
0x6b55ac18 (0x00000028):   0x00380003 
0x6b55ac1c (0x0000002c):   0x00000020 
0x6b55ac20 (0x00000030):   0x000002a5 


Dumping stack starting at frame with bad ref:
-- Stack --
   Lorg/jikesrvm/scheduler/VM_ThinLock;
unlock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 210
   Lorg/jikesrvm/objectmodel/VM_JavaHeader;
genericUnlock(Ljava/lang/Object;)V at line 580
   Lorg/jikesrvm/objectmodel/VM_ObjectModel;
genericUnlock(Ljava/lang/Object;)V at line 471
...

----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-30 07:10

Message:
Logged In: YES 
user_id=1215444
Originator: YES

I just updated to the svn head and given it a good shake. The bug has gone
thanks to Dave's commit (r12375).  I went out on a limb and tested
FastAdaptive too ;-) We can now run eclipse to completion with processors=2
with MS & SS, BaseBase, BaseAdaptive and FastAdaptive.

I'm closing the bug.

Great teamwork!

----------------------------------------------------------------------

Comment By: Dave Grove (dgrove-oss)
Date: 2007-05-30 03:46

Message:
Logged In: YES 
user_id=1215435
Originator: NO

committed a probably fix in 12375.

I haven't rebuilt my GNU classpath yet with the fix for the EOF bug, so
not sure if this "really" fixes eclipse
but it seems like it should do the right thing...



----------------------------------------------------------------------

Comment By: Dave Grove (dgrove-oss)
Date: 2007-05-30 00:38

Message:
Logged In: YES 
user_id=1215435
Originator: NO

I think something like this is right.  If the compile time type of the
locked object is an interface, we can't in general do an inline lock
sequence (although under current object model we could...), but I'd like to
maintain the abstraction for the non-fixed lock offset object model.

Very nice detective work.



----------------------------------------------------------------------

Comment By: Daniel Frampton (dframpton-oss)
Date: 2007-05-30 00:31

Message:
Logged In: YES 
user_id=1215450
Originator: NO

This also sounds good. Basically anything that ensures we hit the slow
path call the lock instead of inlineLock.

Of course considering we have a constant offset for everything with our
current object model we could perform an inlineLock.... But correctness
comes first. If we want to add an option that allows us to take advantage
of fixed lock offsets we can do it later ;).

----------------------------------------------------------------------

Comment By: Ian Rogers (captain5050)
Date: 2007-05-30 00:25

Message:
Logged In: YES 
user_id=308843
Originator: NO

We probably want to set the thinLockOffset for interfaces to max (-1) when
they are resolved to indicate that they don't have them. This will avoid
using the inline lock code and use the regular lock method that's use by
the baseline compiler. I will look at this with Dave shortly.
Ian

----------------------------------------------------------------------

Comment By: Daniel Frampton (dframpton-oss)
Date: 2007-05-30 00:10

Message:
Logged In: YES 
user_id=1215450
Originator: NO

I see that for unresolved types this is already done. Attaching patch for
a possible fix to OPT_ExpandRuntimeServices. No time to test and commit
tonight. Will look at doing this tomorrow if no action overnight.
File Added: possible-fix.patch

----------------------------------------------------------------------

Comment By: Daniel Frampton (dframpton-oss)
Date: 2007-05-29 23:52

Message:
Logged In: YES 
user_id=1215450
Originator: NO

The bug seems to be that in OPT_ExpandRuntimeServices when generating the
MONITORENTER code (and I assume exit also) the lockOffset is taken from the
type of the object. The type is actually an interface, which means that the
thinLockOffset is set as 0, instead of -8 (status word). This of course
starts us off on a strange sequence of events leading to disaster.

Only the optimizing compiler does this optimization (which is incorrect).
I assume it should do some other code path for interfaces (and unresolved
types) as we can not guarantee in general that all objects with a specific
interface have the same lockOffset.

I tested my hypothesis by (incorrectly) setting the thinLockOffset for
VM_Class objects representing interfaces. This is clearly not correct, but
demonstrates what the bug is.

Of course it may be more complicated than that. But this is a very
important hint. Note also this seems to fail at even opt level O0.

----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-29 18:45

Message:
Logged In: YES 
user_id=1215444
Originator: YES

Oh, and Daniel and I figured we were unable to use OptTestHarness because
of eclipse's classloader bizarreness, hence the patch.

----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-29 16:32

Message:
Logged In: YES 
user_id=1215444
Originator: YES

Things have become a great deal clearer:

The problem will manifest 3 ways:
  a. In a GC crash
  b. In a hang
  c. In a corrupt lock word seen by the locking mechanism (initial
observation)
b & c seem to interchange semi-randomly.  a will happen if the heap is
small enough that the GC trips over the broken header before the runtime

We know the types of the objects: the corrupted object is of type
java/util/HashMap, the value with which it is overwritten is of type
java/util/HashMap$2.

Most importantly, we can now reliably reproduce the problem with the patch
below by forcing a single eclipse method to be opt compiled. If a
BaseAdaptive image from the svn head is run with
-X:aos:enable_recompilation=false, eclipse will reliably run to completion
unless the patch below is applied, in which case it will always fail
(either assertion failure or lockup).

Furthermore, the problem now shows up with -X:processors=1.  Daniel
guessed that in light of this, the number of processors was actually only
indirectly relevant (when processors > 1, the lock in this method gets
contended it gets "hot" and so it gets optimized.  So the problem is now
easily repeatable on a uniprocessor.

Also, the lock-up shows (near) zero CPU.

I've looked over the IR of the offending method (below) with Daniel, and
there's lot of it.  We have not yet spotted anything obvious.

Index: rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java
===================================================================
--- rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java      
(revision 12372)
+++ rvm/src/org/jikesrvm/compilers/common/VM_RuntimeCompiler.java      
(working copy)
@@ -655,7 +655,9 @@
             compilationInProgress = false;
           }
         }
-        if (VM_Controller.options.optIRC()) {
+       boolean funnyMethod =
(method.getName().toString().equals("getPerProjectInfo") &&
method.getDeclaringClass().toString().contains("JavaModelManager"));
+       if (funnyMethod) VM.sysWriteln("Compiling a 'funny' method.");
+        if (VM_Controller.options.optIRC() || funnyMethod) {
           if (// will only run once: don't bother optimizing
               method.isClassInitializer() ||
               // exception in progress. can't use opt compiler:


The method:

	public PerProjectInfo getPerProjectInfo(IProject project, boolean create)
{
		synchronized(this.perProjectInfos) { // use the perProjectInfo
collection as its own lock
			PerProjectInfo info= (PerProjectInfo)
this.perProjectInfos.get(project);
			if (info == null && create) {
				info= new PerProjectInfo(project);
				this.perProjectInfos.put(project, info);
			}
			return info;
		}
	}	
	

----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-29 12:38

Message:
Logged In: YES 
user_id=1215444
Originator: YES

A small update: the corruption is apparently due to a reference being
written over the header word (ie the word values are always reasonable
pointer values).  As noted before, the problem is evident no matter what GC
is used, but does not show up with BaseBase builds.

----------------------------------------------------------------------

Comment By: Dave Grove (dgrove-oss)
Date: 2007-05-25 23:21

Message:
Logged In: YES 
user_id=1215435
Originator: NO

I agree.  Lock index too big is one of the "classic" signs of memory
corruption.  The status word of an object get's smashed and it just happens
that the locking code stumbles across it before anyone else.



----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-25 17:01

Message:
Logged In: YES 
user_id=1215444
Originator: YES

With a bit more verbosity, I see this on failure:

[GC 8 Start 68.11 s   656596KB -> 182200KB   2194.25 ms]
Lock index out of range! Word: 0x832092f8 index: 51236 locks: 4097
vm internal error at:
-- Stack --
   Lorg/jikesrvm/VM; sysFail(Ljava/lang/String;)V at line 1988
   Lorg/jikesrvm/VM;
_assertionFailure(Ljava/lang/String;Ljava/lang/String;)V at line 527
   Lorg/jikesrvm/VM; _assert(ZLjava/lang/String;Ljava/lang/String;)V at
line 510
   Lorg/jikesrvm/VM; _assert(Z)V at line 490
   Lorg/jikesrvm/scheduler/VM_ThinLock;
getLockIndex(Lorg/vmmagic/unboxed/Word;)I at line 55
   Lorg/jikesrvm/scheduler/VM_ThinLock;
lock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 154
   Lorg/jikesrvm/scheduler/VM_ThinLock;
inlineLock(Ljava/lang/Object;Lorg/vmmagic/unboxed/Offset;)V at line 84
   Lorg/eclipse/jdt/internal/core/JavaModelManager;
getPerProjectInfo(Lorg/eclipse/core/resources/IProject;Z)Lorg/eclipse/jdt/internal/core/JavaModelManager$PerProjectInfo;
at line 1204


----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-25 16:26

Message:
Logged In: YES 
user_id=1215444
Originator: YES

Refactoring and assertions added in r12330 seem to suggest some issue with
corruption.  We see now that the problem is that when getting the lock
index, the index is sometimes out of range.  However, this same assertion
is checked at the time the value is stored and that assertion is never
triggered...

Forcing VM_ThinkLock.getLockIndex() to be NoOptCompile did not seem to
help.

Further BaseBase testing continues to run eclipse without failure (many
hours now on a fast machine without failure).

----------------------------------------------------------------------

Comment By: Steve Blackburn (steveb-oss)
Date: 2007-05-25 12:36

Message:
Logged In: YES 
user_id=1215444
Originator: YES

This bug does not appear under BaseBase builds (in fact I have seen no
bugs with eclipse under BaseBase, despite a lot of continuous testing in a
loop.)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=712768&aid=1725306&group_id=128805