[Bigdata-commit] SF.net SVN: bigdata:[4463] branches/QUADS_QUERY_BRANCH/bigdata/src

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 4463
          http://bigdata.svn.sourceforge.net/bigdata/?rev=4463&view=rev
Author:   thompsonbry
Date:     2011-05-08 14:15:50 +0000 (Sun, 08 May 2011)

Log Message:
-----------
Continued work on the htree. At this point I believe that I have all of the relevant bit math working with appropriate unit tests (except for maskOffLSB).  

This commit includes a refactoring of the internal interfaces for the B+Tree to support some of the differences between the BTree and the HTree.

The next steps on the HTree are to write the recursive descent and tuple management logic, followed by handling splits in the buddy buckets and buddy hash tables and the gradual deepening of the hash tree.

Some of the mechanisms for persistence have been incorporated into the hash tree, but there needs to be a reconcile between the HTree and the BTree at some point so they can share more code (abstract classes) and interfaces for generalized persistent indices.

Modified Paths:
--------------
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractBTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractNode.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BytesUtil.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/IndexSegmentBuilder.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/UnisolatedReadWriteIndex.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IAbstractNodeData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ILeafData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/INodeData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/HashBucket.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/HashTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/data/IBucketData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/data/IDirectoryData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/btree/AbstractBTreeTestCase.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/btree/TestBytesUtil.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htree/TestAll.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htree/TestExtensibleHashing.java

Added Paths:
-----------
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IChildData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IKeysData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ISpannedTupleCountData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/AbstractHTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/HTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/HTreeUtil.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/MutableDirectoryPageData.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htree/TestHTree.java
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htree/TestHTreeUtil.java

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractBTree.java
===================================================================

--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractBTree.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractBTree.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -867,21 +867,21 @@
         return counterSet;
 
     }
-    
-    /**
-     * @param store
-     *            The persistence store.
-     * @param nodeFactory
-     *            Object that provides a factory for node and leaf objects.
-     * @param addrSer
-     *            Object that knows how to (de-)serialize the child addresses in
-     *            an {@link INodeData}.
-     * @param readOnly
-     *            <code>true</code> IFF it is <em>known</em> that the
-     *            {@link AbstractBTree} is read-only.
-     * @param metadata
-     *            The {@link IndexMetadata} object for this B+Tree.
-     */
+
+	/**
+	 * @param store
+	 *            The persistence store.
+	 * @param nodeFactory
+	 *            Object that provides a factory for node and leaf objects.
+	 * @param readOnly
+	 *            <code>true</code> IFF it is <em>known</em> that the
+	 *            {@link AbstractBTree} is read-only.
+	 * @param metadata
+	 *            The {@link IndexMetadata} object for this B+Tree.
+	 * @param recordCompressorFactory
+	 *            Object that knows how to (de-)compress the serialized data
+	 *            records.
+	 */
     protected AbstractBTree(//
             final IRawStore store,//
             final INodeFactory nodeFactory,//
@@ -3761,40 +3761,40 @@
         if (addr == IRawStore.NULL)
             throw new IllegalArgumentException();
         
-        final Long addr2 = Long.valueOf(addr); 
-
-        if (storeCache != null) {
-
-            // test cache : will touch global LRU iff found.
-            final IAbstractNodeData data = (IAbstractNodeData) storeCache
-                    .get(addr);
-
-            if (data != null) {
-
-                // Node and Leaf MUST NOT make it into the global LRU or store
-                // cache!
-                assert !(data instanceof AbstractNode<?>);
-                
-                final AbstractNode<?> node;
-                
-                if (data.isLeaf()) {
-
-                    node = nodeSer.nodeFactory.allocLeaf(this, addr,
-                            (ILeafData) data);
-
-                } else {
-
-                    node = nodeSer.nodeFactory.allocNode(this, addr,
-                            (INodeData) data);
-
-                }
-
-                // cache hit.
-                return node;
-                
-            }
-            
-        }
+//        final Long addr2 = Long.valueOf(addr); 
+//
+//        if (storeCache != null) {
+//
+//            // test cache : will touch global LRU iff found.
+//            final IAbstractNodeData data = (IAbstractNodeData) storeCache
+//                    .get(addr);
+//
+//            if (data != null) {
+//
+//                // Node and Leaf MUST NOT make it into the global LRU or store
+//                // cache!
+//                assert !(data instanceof AbstractNode<?>);
+//                
+//                final AbstractNode<?> node;
+//                
+//                if (data.isLeaf()) {
+//
+//                    node = nodeSer.nodeFactory.allocLeaf(this, addr,
+//                            (ILeafData) data);
+//
+//                } else {
+//
+//                    node = nodeSer.nodeFactory.allocNode(this, addr,
+//                            (INodeData) data);
+//
+//                }
+//
+//                // cache hit.
+//                return node;
+//                
+//            }
+//            
+//        }
         
         final ByteBuffer tmp;
         {
@@ -3851,21 +3851,21 @@
 
             }
 
-            if (storeCache != null) {
-             
-                // update cache : will touch global LRU iff cache is modified.
-                final IAbstractNodeData data2 = (IAbstractNodeData) storeCache
-                        .putIfAbsent(addr2, data);
+//            if (storeCache != null) {
+//             
+//                // update cache : will touch global LRU iff cache is modified.
+//                final IAbstractNodeData data2 = (IAbstractNodeData) storeCache
+//                        .putIfAbsent(addr2, data);
+//
+//                if (data2 != null) {
+//
+//                    // concurrent insert, use winner's value.
+//                    data = data2;
+//
+//                } 
+//                
+//            }
 
-                if (data2 != null) {
-
-                    // concurrent insert, use winner's value.
-                    data = data2;
-
-                } 
-                
-            }
-
             // wrap as Node or Leaf.
             final AbstractNode<?> node = nodeSer.wrap(this, addr, data);
 

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractNode.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractNode.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/AbstractNode.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -35,6 +35,8 @@
 import org.apache.log4j.Logger;
 
 import com.bigdata.btree.data.IAbstractNodeData;
+import com.bigdata.btree.data.IKeysData;
+import com.bigdata.btree.data.ISpannedTupleCountData;
 import com.bigdata.btree.filter.EmptyTupleIterator;
 import com.bigdata.btree.raba.IRaba;
 import com.bigdata.btree.raba.MutableKeyBuffer;
@@ -55,7 +57,7 @@
  * DO-NOT-USE-GENERIC-HERE. The compiler will fail under Linux (JDK 1.6.0_14,
  * _16).
  */
-> extends PO implements IAbstractNode, IAbstractNodeData {
+> extends PO implements IAbstractNode, IAbstractNodeData, IKeysData, ISpannedTupleCountData {
 
     /**
      * Log for node and leaf operations.

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BTree.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BTree.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BTree.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -1162,46 +1162,62 @@
         assertNotTransient();
         assertNotReadOnly();
 
-        /*
-         * Note: Acquiring this lock provides for atomicity of the checkpoint of
-         * the BTree during the commit protocol. Without this lock, users of the
-         * UnisolatedReadWriteIndex could be concurrently modifying the BTree
-         * while we are attempting to snapshot it for the commit.
-         * 
-         * @see https://sourceforge.net/apps/trac/bigdata/ticket/288
-         * @see https://sourceforge.net/apps/trac/bigdata/ticket/278
-         */
-        final Lock lock = new UnisolatedReadWriteIndex(this).writeLock();
-        try {
-                
-            if (/*autoCommit &&*/ needsCheckpoint()) {
+		/*
+		 * Note: Acquiring this lock provides for atomicity of the checkpoint of
+		 * the BTree during the commit protocol. Without this lock, users of the
+		 * UnisolatedReadWriteIndex could be concurrently modifying the BTree
+		 * while we are attempting to snapshot it for the commit.
+		 * 
+		 * Note: An alternative design would declare a global read/write lock
+		 * for mutation of the indices in addition to the per-BTree read/write
+		 * lock provided by UnisolatedReadWriteIndex. Rather than taking the
+		 * per-BTree write lock here, we would take the global write lock in the
+		 * AbstractJournal's commit protocol, e.g., commitNow(). The global read
+		 * lock would be taken by UnisolatedReadWriteIndex before taking the
+		 * per-BTree write lock. This is effectively a hierarchical locking
+		 * scheme and could provide a workaround if deadlocks are found to occur
+		 * due to lock ordering problems with the acquisition of the
+		 * UnisolatedReadWriteIndex lock (the absence of lock ordering problems
+		 * really hinges around UnisolatedReadWriteLocks not being taken for
+		 * more than one index at a time.)
+		 * 
+		 * @see https://sourceforge.net/apps/trac/bigdata/ticket/288
+		 * 
+		 * @see https://sourceforge.net/apps/trac/bigdata/ticket/278
+		 */
+		final Lock lock = new UnisolatedReadWriteIndex(this).writeLock();
+		try {
 
-                /*
-                 * Flush the btree, write a checkpoint record, and return the
-                 * address of that checkpoint record. The [checkpoint] reference
-                 * is also updated.
-                 */
+			if (/* autoCommit && */needsCheckpoint()) {
 
-                return writeCheckpoint();
+				/*
+				 * Flush the btree, write a checkpoint record, and return the
+				 * address of that checkpoint record. The [checkpoint] reference
+				 * is also updated.
+				 */
 
-            }
+				return writeCheckpoint();
 
-            /*
-             * There have not been any writes on this btree or auto-commit is
-             * disabled.
-             * 
-             * Note: if the application has explicitly invoked writeCheckpoint()
-             * then the returned address will be the address of that checkpoint
-             * record and the BTree will have a new checkpoint address made
-             * restart safe on the backing store.
-             */
-    
-            return checkpoint.addrCheckpoint;
+			}
 
-        } finally {
-            lock.unlock();
-        }
+			/*
+			 * There have not been any writes on this btree or auto-commit is
+			 * disabled.
+			 * 
+			 * Note: if the application has explicitly invoked writeCheckpoint()
+			 * then the returned address will be the address of that checkpoint
+			 * record and the BTree will have a new checkpoint address made
+			 * restart safe on the backing store.
+			 */
 
+			return checkpoint.addrCheckpoint;
+
+		} finally {
+
+			lock.unlock();
+
+		}
+
     }
     
     /**

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BytesUtil.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BytesUtil.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/BytesUtil.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -1034,8 +1034,8 @@
 	 * @return The hash value considering only the MSB <i>nbits</i> and shifted
 	 *         down into an <i>nbits</i> integer.
 	 */
-	public static int maskOff(final int h, final int nbits) {
-	
+	public static int maskOffMSB(final int h, final int nbits) {
+		
 	    if (nbits < 0 || nbits > 32)
 	        throw new IllegalArgumentException();
 	
@@ -1048,6 +1048,29 @@
 	}
 
 	/**
+	 * Mask off all but the LSB <i>nbits</i> of the hash value.
+	 * 
+	 * @param h
+	 *            The hash value.
+	 * @param nbits
+	 *            The #of LSB bits to be retained.
+	 * 
+	 * @return The LSB <i>nbits</i>.
+	 * 
+	 * TODO unit test.
+	 */
+	public static int maskOffLSB(final int h, final int nbits) {
+		
+	    if (nbits < 0 || nbits > 32)
+	        throw new IllegalArgumentException();
+	
+	    final int v = h & ~masks32[nbits];
+	
+	    return v;
+	
+	}
+
+	/**
 	 * Return the n-bit integer corresponding to the inclusive bit range of the
 	 * byte[]. Bit ZERO (0) is the Most Significant Bit (MSB). Bit positions
 	 * increase from zero up to <code>a.length * 8 - 1</code>. The return value

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/IndexSegmentBuilder.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/IndexSegmentBuilder.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/IndexSegmentBuilder.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -48,6 +48,7 @@
 import com.bigdata.btree.data.IAbstractNodeData;
 import com.bigdata.btree.data.ILeafData;
 import com.bigdata.btree.data.INodeData;
+import com.bigdata.btree.data.ISpannedTupleCountData;
 import com.bigdata.btree.raba.IRaba;
 import com.bigdata.btree.raba.MutableKeyBuffer;
 import com.bigdata.btree.raba.MutableValueBuffer;
@@ -3284,7 +3285,7 @@
      *         Thompson</a>
      */
     abstract protected static class AbstractSimpleNodeData implements
-            IAbstractNodeData {
+            IAbstractNodeData, ISpannedTupleCountData {
 
         /**
          * The level in the output tree for this node or leaf (origin zero). The

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/UnisolatedReadWriteIndex.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/UnisolatedReadWriteIndex.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/UnisolatedReadWriteIndex.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -45,7 +45,6 @@
 import com.bigdata.btree.proc.IResultHandler;
 import com.bigdata.btree.proc.ISimpleIndexProcedure;
 import com.bigdata.btree.view.FusedView;
-import com.bigdata.concurrent.LockManager;
 import com.bigdata.counters.ICounterSet;
 import com.bigdata.journal.ConcurrencyManager;
 import com.bigdata.journal.IConcurrencyManager;
@@ -112,22 +111,19 @@
  * require a means to correctly interleave access to the unisolated index, which
  * is the purpose of this class.
  * <p>
- * While the {@link LockManager} could be modified to support Share vs Exclusive
- * locks and to use Share locks for readers and Exclusive locks for writers,
- * writers would still block until the next commit so the throughput (e.g., when
+ * While the lock manager could be modified to support Share vs Exclusive locks
+ * and to use Share locks for readers and Exclusive locks for writers, writers
+ * would still block until the next commit so the throughput (e.g., when
  * computing the fix point of a rule set) is significantly lower.
  * 
  * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
- * @version $Id$
+ * @version $Id: UnisolatedReadWriteIndex.java 4054 2011-01-05 13:51:25Z
+ *          thompsonbry $
  */
 public class UnisolatedReadWriteIndex implements IIndex {
 
-    protected static final Logger log = Logger.getLogger(UnisolatedReadWriteIndex.class);
-
-//    protected static final boolean INFO = log.isInfoEnabled();
-
-    protected static final boolean DEBUG = log.isDebugEnabled();
-    
+    private static final Logger log = Logger.getLogger(UnisolatedReadWriteIndex.class);
+  
     /**
      * The #of milliseconds that the class will wait for a read or write lock. A
      * (wrapped) {@link InterruptedException} will be thrown if this timeout is
@@ -158,7 +154,7 @@
         
         try {
             
-            if(DEBUG) {
+            if(log.isDebugEnabled()) {
                 
                 log.debug(ndx.toString());
                 
@@ -198,7 +194,7 @@
 
         try {
 
-            if(DEBUG) {
+            if(log.isDebugEnabled()) {
                 
                 log.debug(ndx.toString()
 //                        , new RuntimeException()
@@ -233,7 +229,7 @@
      *            
      * @return The acquired lock.
      */
-    private Lock lock(IIndexProcedure proc) {
+    private Lock lock(final IIndexProcedure proc) {
      
         if (proc == null)
             throw new IllegalArgumentException();
@@ -248,11 +244,11 @@
         
     }
 
-    private void unlock(Lock lock) {
+    private void unlock(final Lock lock) {
         
         lock.unlock();
         
-        if(DEBUG) {
+        if(log.isDebugEnabled()) {
             
             log.debug(ndx.toString());
             
@@ -264,7 +260,7 @@
      * The unisolated index partition. This is either a {@link BTree} or a
      * {@link FusedView}.
      */
-    final private IIndex ndx;
+    final private BTree ndx;
     
     /**
      * The {@link ReadWriteLock} used to permit concurrent readers on an
@@ -343,23 +339,41 @@
 
         this.ndx = ndx;
 
-        synchronized (locks) {
+        this.readWriteLock = getReadWriteLock(ndx);
 
-            ReadWriteLock readWriteLock = locks.get(ndx);
+    }
 
-            if (readWriteLock == null) {
+	/**
+	 * Canonicalizing factory for the {@link ReadWriteLock} for a {@link BTree}.
+	 * 
+	 * @param btree
+	 *            The btree.
+	 * @return The lock.
+	 * 
+	 * @throws IllegalArgumentException
+	 *             if the argument is <code>null</code>.
+	 */
+	private ReadWriteLock getReadWriteLock(final BTree btree) {
 
-                readWriteLock = new ReentrantReadWriteLock(false/* fair */);
+		if (ndx == null)
+			throw new IllegalArgumentException();
+		
+		synchronized (locks) {
 
-                locks.put(ndx, readWriteLock);
+			ReadWriteLock readWriteLock = locks.get(ndx);
 
-            }
+			if (readWriteLock == null) {
 
-            this.readWriteLock = readWriteLock;
-            
-        }
-        
-    }
+				readWriteLock = new ReentrantReadWriteLock(false/* fair */);
+
+				locks.put(ndx, readWriteLock);
+
+			}
+
+			return readWriteLock;
+		}
+
+	}   
     
     public String toString() {
         

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IAbstractNodeData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IAbstractNodeData.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IAbstractNodeData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -27,7 +27,6 @@
 
 package com.bigdata.btree.data;
 
-import com.bigdata.btree.raba.IRaba;
 import com.bigdata.io.AbstractFixedByteArrayBuffer;
 import com.bigdata.io.IDataRecordAccess;
 
@@ -93,28 +92,28 @@
      */
     long getMaximumVersionTimestamp();
     
-    /**
-     * The #of tuples spanned by this node or leaf. For a leaf this is always
-     * the #of keys.
-     * 
-     * @see INodeData#getChildEntryCounts()
-     */
-    int getSpannedTupleCount();
+//    /**
+//     * The #of tuples spanned by this node or leaf. For a leaf this is always
+//     * the #of keys.
+//     * 
+//     * @see INodeData#getChildEntryCounts()
+//     */
+//    int getSpannedTupleCount();
 
-    /**
-     * Return the #of keys in the node or leaf. A node has <code>nkeys+1</code>
-     * children. A leaf has <code>nkeys</code> keys and values. The maximum #of
-     * keys for a node is one less than the branching factor of the B+Tree. The
-     * maximum #of keys for a leaf is the branching factor of the B+Tree. For a
-     * hash bucket, this is the #of entries in the bucket.
-     * 
-     * @return The #of defined keys.
-     */
-    int getKeyCount();
+//    /**
+//     * Return the #of keys in the node or leaf. A node has <code>nkeys+1</code>
+//     * children. A leaf has <code>nkeys</code> keys and values. The maximum #of
+//     * keys for a node is one less than the branching factor of the B+Tree. The
+//     * maximum #of keys for a leaf is the branching factor of the B+Tree. For a
+//     * hash bucket, this is the #of entries in the bucket.
+//     * 
+//     * @return The #of defined keys.
+//     */
+//    int getKeyCount();
+//
+//    /**
+//     * The object used to contain and manage the keys.
+//     */
+//    IRaba getKeys();
 
-    /**
-     * The object used to contain and manage the keys.
-     */
-    IRaba getKeys();
-
 }

Added: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IChildData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IChildData.java	                        (rev 0)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IChildData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -0,0 +1,57 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2007.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on May 2nd, 2011
+ */
+package com.bigdata.btree.data;
+
+/**
+ * Interface for data access to children of an index node.
+ * 
+ * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @version $Id: ILeafData.java 4388 2011-04-11 13:35:47Z thompsonbry $
+ */
+public interface IChildData {
+
+    /**
+     * The #of children of this node. Either all children will be nodes or all
+     * children will be leaves. The #of children of a node MUST be
+     * <code>{@link IAbstractNodeData#getKeyCount()}+1</code>
+     * 
+     * @return The #of children of this node.
+     */
+    public int getChildCount();
+
+    /**
+     * Return the persistent addresses of the specified child node.
+     * 
+     * @param index
+     *            The index of the child in [0:nkeys].
+     * 
+     * @return The persistent child address -or- zero(0L) if the child is not
+     *         persistent.
+     */
+    public long getChildAddr(int index);
+
+}

Added: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IKeysData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IKeysData.java	                        (rev 0)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/IKeysData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -0,0 +1,56 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2007.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on May 2nd, 2011
+ */
+package com.bigdata.btree.data;
+
+import com.bigdata.btree.raba.IRaba;
+
+/**
+ * Interface for access to the keys {@link IRaba} of a node or leaf in an index
+ * data structure. 
+ *  
+ * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @version $Id: ILeafData.java 4388 2011-04-11 13:35:47Z thompsonbry $
+ */
+public interface IKeysData {
+
+    /**
+     * Return the #of keys in the node or leaf. A node has <code>nkeys+1</code>
+     * children. A leaf has <code>nkeys</code> keys and values. The maximum #of
+     * keys for a node is one less than the branching factor of the B+Tree. The
+     * maximum #of keys for a leaf is the branching factor of the B+Tree. For a
+     * hash bucket, this is the #of entries in the bucket.
+     * 
+     * @return The #of defined keys.
+     */
+    int getKeyCount();
+
+    /**
+     * The object used to contain and manage the keys.
+     */
+    IRaba getKeys();
+
+}

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ILeafData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ILeafData.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ILeafData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -37,7 +37,7 @@
  * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
  * @version $Id$
  */
-public interface ILeafData extends IAbstractNodeData {
+public interface ILeafData extends IAbstractNodeData, IKeysData, ISpannedTupleCountData {
 
     /**
      * The #of values in the leaf (this MUST be equal to the #of keys for a

Modified: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/INodeData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/INodeData.java	2011-05-07 15:44:27 UTC (rev 4462)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/INodeData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -27,36 +27,35 @@
 
 package com.bigdata.btree.data;
 
-
 /**
  * Interface for low-level data access for the non-leaf nodes of a B+-Tree.
  * 
  * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
  * @version $Id$
  */
-public interface INodeData extends IAbstractNodeData {
+public interface INodeData extends IAbstractNodeData, IKeysData, IChildData, ISpannedTupleCountData {
 
-    /**
-     * The #of children of this node. Either all children will be nodes or all
-     * children will be leaves. The #of children of a node MUST be
-     * <code>{@link IAbstractNodeData#getKeyCount()}+1</code>
-     * 
-     * @return The #of children of this node.
-     */
-    public int getChildCount();
+//    /**
+//     * The #of children of this node. Either all children will be nodes or all
+//     * children will be leaves. The #of children of a node MUST be
+//     * <code>{@link IAbstractNodeData#getKeyCount()}+1</code>
+//     * 
+//     * @return The #of children of this node.
+//     */
+//    public int getChildCount();
+//
+//    /**
+//     * Return the persistent addresses of the specified child node.
+//     * 
+//     * @param index
+//     *            The index of the child in [0:nkeys].
+//     * 
+//     * @return The persistent child address -or- zero(0L) if the child is not
+//     *         persistent.
+//     */
+//    public long getChildAddr(int index);
 
     /**
-     * Return the persistent addresses of the specified child node.
-     * 
-     * @param index
-     *            The index of the child in [0:nkeys].
-     * 
-     * @return The persistent child address -or- zero(0L) if the child is not
-     *         persistent.
-     */
-    public long getChildAddr(int index);
-
-    /**
      * Return the #of tuples spanned by the indicated child of this node. The
      * sum of the values returned by this method across the children of the node
      * should always equal the value returned by {@link #getSpannedTupleCount()}

Added: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ISpannedTupleCountData.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ISpannedTupleCountData.java	                        (rev 0)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/btree/data/ISpannedTupleCountData.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -0,0 +1,47 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2007.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on Dec 15, 2006
+ */
+
+package com.bigdata.btree.data;
+
+/**
+ * Interface for low-level data access to the #of tuples spanned by a node or
+ * leaf of an index.
+ * 
+ * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @version $Id: ILeafData.java 4388 2011-04-11 13:35:47Z thompsonbry $
+ */
+public interface ISpannedTupleCountData {
+
+    /**
+     * The #of tuples spanned by this node or leaf. For a leaf this is always
+     * the #of keys.
+     * 
+     * @see INodeData#getChildEntryCounts()
+     */
+    int getSpannedTupleCount();
+
+}

Added: branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/AbstractHTree.java
===================================================================
--- branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/AbstractHTree.java	                        (rev 0)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/java/com/bigdata/htree/AbstractHTree.java	2011-05-08 14:15:50 UTC (rev 4463)
@@ -0,0 +1,918 @@
+package com.bigdata.htree;
+
+import java.lang.ref.Reference;
+import java.lang.ref.SoftReference;
+import java.lang.ref.WeakReference;
+import java.nio.ByteBuffer;
+import java.util.HashMap;
+
+import org.apache.log4j.Logger;
+
+import com.bigdata.Banner;
+import com.bigdata.LRUNexus;
+import com.bigdata.btree.AbstractBTree;
+import com.bigdata.btree.AbstractNode;
+import com.bigdata.btree.BTree;
+import com.bigdata.btree.BTreeCounters;
+import com.bigdata.btree.DefaultEvictionListener;
+import com.bigdata.btree.IndexMetadata;
+import com.bigdata.btree.IndexSegment;
+import com.bigdata.btree.Leaf;
+import com.bigdata.btree.Node;
+import com.bigdata.btree.NodeSerializer;
+import com.bigdata.btree.PO;
+import com.bigdata.btree.data.IAbstractNodeData;
+import com.bigdata.btree.data.ILeafData;
+import com.bigdata.btree.data.INodeData;
+import com.bigdata.cache.HardReferenceQueue;
+import com.bigdata.cache.HardReferenceQueueWithBatchingUpdates;
+import com.bigdata.cache.IHardReferenceQueue;
+import com.bigdata.cache.RingBuffer;
+import com.bigdata.htree.HTree.AbstractPage;
+import com.bigdata.htree.HTree.DirectoryPage;
+import com.bigdata.rawstore.IRawStore;
+import com.bigdata.resources.IndexManager;
+import com.bigdata.service.DataService;
+
+/**
+ * Abstract base class for a persistence capable extensible hash tree.
+ * 
+ * @author <a href="mailto:tho...@us...">Bryan Thompson</a>
+ * @version $Id$
+ */
+abstract public class AbstractHTree {
+
+    /**
+     * The index is already closed.
+     */
+    protected static final String ERROR_CLOSED = "Closed";
+
+    /**
+     * A parameter was less than zero.
+     */
+    protected static final String ERROR_LESS_THAN_ZERO = "Less than zero";
+
+    /**
+     * A parameter was too large.
+     */
+    protected static final String ERROR_TOO_LARGE = "Too large";
+
+    /**
+     * The index is read-only but a mutation operation was requested.
+     */
+    final protected static String ERROR_READ_ONLY = "Read-only";
+    
+    /**
+     * The index is transient (not backed by persistent storage) but an
+     * operation that requires persistence was requested.
+     */
+    final protected static String ERROR_TRANSIENT = "Transient";
+    
+    private static final transient Logger log = Logger.getLogger(AbstractHTree.class);
+    
+    /**
+     * Counters tracking various aspects of the btree.
+     * <p>
+     * Note: This is <code>volatile</code> to avoid the need for
+     * synchronization in order for changes in the reference to be visible to
+     * threads.
+     */
+    private volatile BTreeCounters btreeCounters = new BTreeCounters();
+
+    /**
+     * Counters tracking various aspects of the btree.
+     */
+    final public BTreeCounters getBtreeCounters() {
+    
+        return btreeCounters;
+        
+    }
+
+    /**
+     * Replace the {@link BTreeCounters}.
+     * <p>
+     * Note: This is used by the {@link IndexManager} to ensure that an index
+     * loaded from its backing store uses the {@link BTreeCounters} associated
+     * with that index since the {@link DataService} was last (re-)started.
+     * 
+     * @param btreeCounters
+     *            The counters to be used.
+     * 
+     * @throws IllegalArgumentException
+     *             if the argument is <code>null</code>.
+     */
+    final public void setBTreeCounters(final BTreeCounters btreeCounters) {
+
+        if (btreeCounters == null)
+            throw new IllegalArgumentException();
+
+        /*
+         * Note: synchronized NOT required since reference is volatile.
+         */
+
+        this.btreeCounters = btreeCounters;
+
+    }
+    
+    /**
+     * The backing store.
+     */
+    protected final IRawStore store;
+
+	/**
+	 * The #of bits in the address space for a directory page (from the
+	 * constructor). This constant is specified when the hash tree is created. A
+	 * directory page has <code>2^addressBits</code> entries. Those entries are
+	 * divided up among one or more buddy hash tables on that page.
+	 */
+    protected final int addressBits;
+
+	/**
+	 * The #of entries in a directory bucket, which is 2^{@link #addressBits}
+	 * (aka <code>1<<addressBits</code>).
+	 */
+    protected final int branchingFactor;
+    
+    /**
+     * The root node.
+     */
+    protected volatile AbstractPage root;
+
+    /**
+     * Nodes (that is nodes or leaves) are added to a hard reference queue when
+     * they are created or read from the store. On eviction from the queue a
+     * dirty node is serialized by a listener against the {@link IRawStore}.
+     * The nodes and leaves refer to their parent with a {@link WeakReference}s.
+     * Likewise, nodes refer to their children with a {@link WeakReference}.
+     * The hard reference queue in combination with {@link #touch(AbstractNode)}
+     * and with hard references held on the stack ensures that the parent and/or
+     * children remain reachable during operations. Once the node is no longer
+     * strongly reachable weak references to that node may be cleared by the VM -
+     * in this manner the node will become unreachable by navigation from its
+     * ancestors in the btree.  The special role of the hard reference queue is
+     * to further ensure that dirty nodes remain dirty by defering persistence
+     * until the reference count for the node is zero during an eviction from
+     * the queue.
+     * <p>
+     * Note that nodes are evicted as new nodes are added to the hard reference
+     * queue. This occurs in two situations: (1) when a new node is created
+     * during a split of an existing node; and (2) when a node is read in from
+     * the store. Inserts on this hard reference queue always drive evictions.
+     * Incremental writes basically make it impossible for the commit set to get
+     * "too large" - the maximum #of nodes to be written is bounded by the size
+     * of the hard reference queue. This helps to ensure fast commit operations
+     * on the store.
+     * <p>
+     * The minimum capacity for the hard reference queue is two (2) so that a
+     * split may occur without forcing eviction of either node participating in
+     * the split.
+     * <p>
+     * Note: The code in {@link Node#postOrderNodeIterator(boolean, boolean)} and
+     * {@link DirtyChildIterator} MUST NOT touch the hard reference queue since
+     * those iterators are used when persisting a node using a post-order
+     * traversal. If a hard reference queue eviction drives the serialization of
+     * a node and we touch the hard reference queue during the post-order
+     * traversal then we break down the semantics of
+     * {@link HardReferenceQueue#add(Object)} as the eviction does not
+     * necessarily cause the queue to reduce in length. Another way to handle
+     * this is to have {@link HardReferenceQueue#add(Object)} begin to evict
+     * objects before is is actually at capacity, but that is also a bit
+     * fragile.
+     * <p>
+     * Note: The {@link #writeRetentionQueue} uses a {@link HardReferenceQueue}.
+     * This is based on a {@link RingBuffer} and is very fast. It does not use a
+     * {@link HashMap} because we can resolve the {@link WeakReference} to the
+     * child {@link Node} or {@link Leaf} using top-down navigation as long as
+     * the {@link Node} or {@link Leaf} remains strongly reachable (that is, as
+     * long as it is on the {@link #writeRetentionQueue} or otherwise strongly
+     * held). This means that lookup in a map is not required for top-down
+     * navigation.
+     * <p>
+     * The {@link LRUNexus} provides an {@link INodeData} / {@link ILeafData}
+     * data record cache based on a hash map with lookup by the address of the
+     * node or leaf. This is tested when the child {@link WeakReference} was
+     * never set or has been cleared. This cache is also used by the
+     * {@link IndexSegment} for the linked-leaf traversal pattern, which does
+     * not use top-down navigation.
+     * 
+     * @todo consider a policy that dynamically adjusts the queue capacities
+     *       based on the height of the btree in order to maintain a cache that
+     *       can contain a fixed percentage, e.g., 5% or 10%, of the nodes in
+     *       the btree. The minimum and maximum size of the cache should be
+     *       bounded. Bounding the minimum size gives better performance for
+     *       small trees. Bounding the maximum size is necessary when the trees
+     *       grow very large. (Partitioned indices may be used for very large
+     *       indices and they can be distributed across a cluster of machines.)
+     *       <p>
+     *       There is a discussion of some issues regarding such a policy in the
+     *       code inside of
+     *       {@link Node#Node(BTree btree, AbstractNode oldRoot, int nentries)}.
+     */
+    final protected IHardReferenceQueue<PO> writeRetentionQueue;
+
+    /**
+     * The #of distinct nodes and leaves on the {@link #writeRetentionQueue}.
+     */
+    protected int ndistinctOnWriteRetentionQueue;
+
+    /**
+     * Return <code>true</code> iff this B+Tree is read-only.
+     */
+    abstract public boolean isReadOnly();
+    
+    /**
+     * 
+     * @throws UnsupportedOperationException
+     *             if the B+Tree is read-only.
+     * 
+     * @see #isReadOnly()
+     */
+    final protected void assertNotReadOnly() {
+        
+        if(isReadOnly()) {
+            
+            throw new UnsupportedOperationException(ERROR_READ_ONLY);
+            
+        }
+        
+    }
+    
+//    /**
+//     * The metadata record for the index. This data rarely changes during the
+//     * life of the {@link HTree} object, but it CAN be changed.
+//     */
+//    protected IndexMetadata metadata;
+
+    /**
+     * Used to serialize and de-serialize the nodes and leaves of the tree.
+     */
+    final protected NodeSerializer nodeSer = null;
+
+    /**
+     * The backing store.
+     */
+    public IRawStore getStore() {
+    	return store;
+    }
+
+	/**
+	 * The #of bits in the address space for a hash directory page. This
+	 * constant is specified to the constructor. The #of child pages is
+	 * <code>2^addressBits</code>. When <i>addressBits</i> is <code>10</code> we
+	 * have <code>2^10 := 1024</code>. If the size of a child address is 4
+	 * bytes, then a 10 bit address space implies a 4k page size.
+	 */
+    public final int getAddressBits() {
+    	return addressBits;
+    }
+    
+    /**
+	 * @param store
+	 *            The persistence store.
+	 * @param nodeFactory
+	 *            Object that provides a factory for node and leaf objects.
+	 * @param readOnly
+	 *            <code>true</code> IFF it is <em>known</em> that the
+	 *            {@link AbstractBTree} is read-only.
+	 * @param metadata
+	 *            The {@link IndexMetadata} object for this B+Tree.
+	 * @param recordCompressorFactory
+	 *            Object that knows how to (de-)compress the serialized data
+	 *            records.
+	 */
+    protected AbstractHTree(//
+            final IRawStore store,//
+//            final INodeFactory nodeFactory,//
+            final boolean readOnly,
+            final int addressBits // TODO Move into IndexMetadata?
+//            final IndexMetadata metadata,//
+//            final IRecordCompressorFactory<?> recordCompressorFactory
+            ) {
+
+        // show the copyright banner during startup.
+        Banner.banner();
+
+        if (store == null)
+            throw new IllegalArgumentException();
+
+        if (addressBits <= 0)
+            throw new IllegalArgumentException();
+
+        if (addressBits > 32)
+            throw new IllegalArgumentException();
+
+//        if (nodeFactory == null)
+//            throw new IllegalArgumentException();
+
+        // Note: MAY be null (implies a transient BTree).
+//        assert store != null;
+
+//        if (metadata == null)
+//            throw new IllegalArgumentException();
+//
+//        // save a reference to the immutable metadata record.
+//        this.metadata = metadata;
+
+//        this.writeTuple = new Tuple(this, KEYS | VALS);
+        
+        this.store = store;
+        this.addressBits = addressBits;
+        this.branchingFactor = 1 << addressBits;
+
+//        /*
+//         * Compute the minimum #of children/values. This is the same whether
+//         * this is a Node or a Leaf.
+//         */
+//        minChildren = (branchingFactor + 1) >> 1;
+
+////        /*
+////         * The Memoizer is not used by the mutable B+Tree since it is not safe
+////         * for concurrent operations.
+////         */
+////        memo = !readOnly ? null : new ChildMemoizer(loadChild);
+//        /*
+//         * Note: The Memoizer pattern is now used for both mutable and read-only
+//         * B+Trees. This is because the real constraint on the mutable B+Tree is
+//         * that mutation may not be concurrent with any other operation but
+//         * concurrent readers ARE permitted. The UnisolatedReadWriteIndex
+//         * explicitly permits concurrent read operations by virtue of using a
+//         * ReadWriteLock rather than a single lock.
+//         */
+//        memo = new ChildMemoizer(loadChild);
+        
+        /*
+         * Setup buffer for Node and Leaf objects accessed via top-down
+         * navigation. While a Node or a Leaf remains on this buffer the
+         * parent's WeakReference to the Node or Leaf will not be cleared and it
+         * will remain reachable.
+         */
+        this.writeRetentionQueue = newWriteRetentionQueue(readOnly);
+
+//        this.nodeSer = new NodeSerializer(//
+//                store, // addressManager
+//                nodeFactory,//
+//                branchingFactor,//
+//                0, //initialBufferCapacity
+//                metadata,//
+//                readOnly,//
+//                recordCompressorFactory
+//                );
+        
+//        if (store == null) {
+//
+//            /*
+//             * Transient BTree.
+//             * 
+//             * Note: The write retention queue controls how long nodes remain
+//             * mutable. On eviction, they are coded but not written onto the
+//             * backing store (since there is none for a transient BTree).
+//             * 
+//             * The readRetentionQueue is not used for a transient BTree since
+//             * the child nodes and the parents are connected using hard links
+//             * rather than weak references.
+//             */
+//
+//            this.storeCache = null;
+//            
+////            this.globalLRU = null;
+//            
+////            this.readRetentionQueue = null;
+//            
+//        } else {
+//
+//            /*
+//             * Persistent BTree.
+//             * 
+//             * The global LRU is used to retain recently used node/leaf data
+//             * records in memory and the per-store cache provides random access
+//             * to those data records. Only the INodeData or ILeafData is stored
+//             * in the cache. This allows reuse of the data records across B+Tree
+//             * instances since the data are read-only and the data records
+//             * support concurrent read operations. The INodeData or ILeafData
+//             * will be wrapped as a Node or Leaf by the owning B+Tree instance.
+//             */
+//
+//            /*
+//             * FIXME if the LRUNexus is disabled, then use a
+//             * ConcurrentWeakValueCacheWithTimeout to buffer the leaves of an
+//             * IndexSegment. Essentially, a custom cache. Otherwise we lose some
+//             * of the performance of the leaf iterator for the index segment
+//             * since leaves are not recoverable by random access without a
+//             * cache.
+//             */
+//            this.storeCache = LRUNexus.getCache(store);
+//            
+////            this.readRetentionQueue = newReadRetentionQueue();
+//        
+//        }
+
+    }
+
+    /**
+     * Note: Method is package private since it must be overridden for some unit
+     * tests.
+     */
+    IHardReferenceQueue<PO> newWriteRetentionQueue(final boolean readOnly) {
+
+    	// FIXME Restore use of the [metadata] object.
+    	final int writeRetentionQueueCapacity = 1000;//metadata.getWriteRetentionQueueCapacity();
+    	final int writeRetentionQueueScan = 5;//metadata.getWriteRetentionQueueScan();
+    	
+        if(readOnly) {
+
+            /*
+             * This provisions an alternative hard reference queue using thread
+             * local queues to collect hard references which are then batched
+             * through to the backing hard reference queue in order to reduce
+             * contention for the lock required to write on the backing hard
+             * reference queue (no lock is required for the thread-local
+             * queues).
+             * 
+             * The danger with a true thread-local design is that a thread can
+             * come in and do some work, get some updates buffered in its
+             * thread-local array, and then never visit again. In this case
+             * those updates would remain buffered on the thread and would not
+             * in fact cause the access order to be updated in a timely manner.
+             * Worse, if you are relying on WeakReference semantics, the
+             * buffered updates would remain strongly reachable and the
+             * corresponding objects would be wired into the cache.
+             * 
+             * I've worked around this issue by scoping the buffers to the
+             * AbstractBTree instance. When the B+Tree container is closed, all
+             * buffered updates were discarded. This nicely eliminated the
+             * problems with "escaping" threads. This approach also has the
+             * maximum concurrency since there is no blocking when adding a
+             * touch to the thread-local buffer.
+             * 
+             * Another approach is to use striped locks. The infinispan BCHM
+             * does this. In this approach, the Segment is guarded by a lock and
+             * the array buffering the touches is inside of the Segment. Since
+             * the Segment is selected by the hash of the key, all Segments will
+             * be visited in a timely fashion for any reasonable workload. This
+             * ensures that updates can not "escape" and will be propagated to
+             * the shared backing buffer in a timely manner.
+             */
+            
+            return new HardReferenceQueueWithBatchingUpdates<PO>(//
+                    new HardReferenceQueue<PO>(new DefaultEvictionListener(),
+                            writeRetentionQueueCapacity, 0/* nscan */),
+//                    new DefaultEvictionListener(),//
+//                    metadata.getWriteRetentionQueueCapacity(),// shared capacity
+                    writeRetentionQueueScan,// thread local
+                    128,//64, // thread-local queue capacity @todo config
+                    64, //32 // thread-local tryLock size @todo config
+                    null // batched updates listener.
+            );
+
+        }
+        
+        return new HardReferenceQueue<PO>(//
+                new DefaultEvictionListener(),//
+                writeRetentionQueueCapacity,//
+                writeRetentionQueueScan//
+        );
+
+    }
+
+    /**
+     * <p>
+     * This method is responsible for putting the node or leaf onto the ring
+     * buffer which controls (a) how long we retain a hard reference to the node
+     * or leaf; and (b) for writes, when the node or leaf is evicted with a zero
+     * reference count and made persistent (along with all dirty children). The
+     * concurrency requirements and the implementation behavior and guarentees
+     * differ depending on whether the B+Tree is read-only or mutable for two
+     * reasons: For writers, the B+Tree is single-threaded so there is no
+     * contention. For readers, every touch on the B+Tree goes through this
+     * point, so it is vital to make this method non-blocking.
+     * </p>
+     * <h3>Writers</h3>
+     * <p>
+     * This method guarantees that the specified node will NOT be synchronously
+     * persisted as a side effect and thereby made immutable. (Of course, the
+     * node may be already immutable.)
+     * </p>
+     * <p>
+     * In conjunction with {@link DefaultEvictionListener}, this method
+     * guarentees that the reference counter for the node will reflect the #of
+     * times that the node is actually present on the
+     * {@link #writeRetentionQueue}.
+     * </p>
+     * <p>
+     * If the node is not found on a scan of the head of the queue, then it is
+     * appended to the queue and its {@link AbstractNode#referenceCount} is
+     * incremented. If a node is being appended to the queue and the queue is at
+     * capacity, then this will cause a reference to be evicted from the queue.
+     * If the reference counter for the evicted node or leaf is zero and the
+     * evicted node or leaf is dirty, then a data record will be coded for the
+     * evicted node or leaf and written onto the backing store. A subsequent
+     * attempt to modify the node or leaf will force copy-on-write for that node
+     * or leaf. Regardless of whether or not the node or leaf is dirty, it is
+     * touched on the {@link LRUNexus#getGlobalLRU()} when it is evicted from
+     * the write retention queue.
+     * </p>
+     * <p>
+     * For the mutable B+Tree we also track the #of references to the node/leaf
+     * on the ring buffer. When that reference count reaches zero we do an
+     * eviction and the node/leaf is written onto the backing store if it is
+     * dirty. Those reference counting games DO NOT matter for read-only views
+     * so we can take a code path which does not update the per-node/leaf
+     * reference count and we do not need to use either synchronization or
+     * atomic counters to track the reference counts.
+     * </p>
+     * <h3>Readers</h3>
+     * <p>
+     * In order to reduce contention for the lock required to update the backing
+     * queue, the {@link #writeRetentionQueue} is configured to collect
+     * references for touched nodes or leaves in a thread-local queue and then
+     * batch those references through to the backing hard reference queue while
+     * holding the lock.
+     * </p>
+     * 
+     * @param node
+     *            The node or leaf.
+     * 
+     * @todo The per-node/leaf reference counts and
+     *       {@link #ndistinctOnWriteRetentionQueue} fields are not guaranteed
+     *       to be consistent for of concurrent readers since no locks are held
+     *       when those fields are updated. Since the reference counts only
+     *       effect when a node is made persistent (assuming its dirty) and
+     *       since we already require single threaded access for writes on the
+     *       btree, this does not cause a problem but can lead to unexpected
+     *       values for the reference counters and
+     *       {@link #ndistinctOnWriteRetentionQueue}.
+     * 
+     * @todo This might be abstract here and final in BTree and in IndexSegment.
+     *       For {@link IndexSegment}, we know it is read-only so we do not need
+     *       to test anything. For {@link BTree}, we can break encapsulation and
+     *       check whether or not the {@link BTree} is read-only more readily
+     *       than we can in this class.
+     */
+//    synchronized
+//    final 
+    protected void touch(final AbstractPage node) {
+
+        assert node != null;
+
+        /*
+         * Note: DO NOT update the last used timestamp for the B+Tree here! This
+         * is a huge performance penalty!
+         */
+//        touch();
+
+        if (isReadOnly()) {
+
+            doTouch(node);
+
+            return;
+
+        }
+
+        /*
+         * Note: Synchronization appears to be necessary for the mutable BTree.
+         * Presumably this provides safe publication when the application is
+         * invoking operations on the same mutable BTree instance from different
+         * threads but it coordinating those threads in order to avoid
+         * concurrent operations, e.g., by using the UnisolatedReadWriteIndex
+         * wrapper class. The error which is prevented by synchronization is
+         * RingBuffer#add(ref) reporting that size == capacity, which indicates
+         * that the size was not updated consistently and hence is basically a
+         * concurrency problem.
+         * 
+         * @todo Actually, I think that this is just a fence post in ringbuffer
+         * beforeOffer() method and the code might work without the synchronized
+         * block if the fence post was fixed.
+         * 
+         * @see https://sourceforge.net/apps/trac/bigdata/ticket/201
+         */
+
+        synchronized (this) {
+
+            doTouch(node);
+
+        }
+
+    }
+    
+    private final void doTouch(final AbstractPage node) {
+        
+        /*
+         * We need to guarantee that touching this node does not cause it to be
+         * made persistent. The condition of interest would arise if the queue
+         * is full and the referenceCount on the node is zero before this method
+         * was called. Under those circumstances, simply appending the node to
+         * the queue would cause it to be evicted and made persistent.
+         * 
+         * We avoid this by incrementing the reference counter before we touch
+         * the queue. Since the reference counter will therefore be positive if
+         * the node is selected for eviction, eviction will not cause the node
+         * to be made persistent.
+         * 
+         * Note: Only mutable BTrees may have dirty nodes and the mutable BTree
+         * is NOT thread-safe so we do not need to use synchronization or an
+         * AtomicInteger for the referenceCount field.
+         * 
+         * Note: The reference counts and the #of distinct nodes or leaves on
+         * the writeRetentionQueue are not exact for a read-only B+Tree because
+         * neither synchronization nor atomic counters are used to track that
+         * information.
+         */
+
+//        assert isReadOnly() || ndistinctOnWriteRetentionQueue > 0;
+
+        node.referenceCount++;
+
+        if (!writeRetentionQueue.add(node)) {
+
+            /*
+             * A false return indicates that the node was found on a scan of the
+             * tail of the queue. In this case we do NOT want the reference
+             * counter to be incremented since we have not actually added
+             * another reference to this node onto the queue. Therefore we
+             * decrement the counter (since we incremented it above) for a net
+             * change of zero(0) across this method.
+             */
+
+            node.referenceCount--;
+
+        } else {
+
+            /*
+             * Since we just added a node or leaf to the hard reference queue we
+             * now update the #of distinct nodes and leaves on the hard
+             * reference queue.
+             * 
+             * Also see {@link DefaultEvictionListener}.
+             */
+
+            if (node.referenceCount == 1) {
+
+                ndistinctOnWriteRetentionQueue++;
+
+            }
+
+        }
+
+//      if (useFinger && node instanceof ILeafData) {
+        //
+//                    if (finger == null || finger.get() != node) {
+        //
+//   ...
 
[truncated message content]

[Bigdata-commit] SF.net SVN: bigdata:[4463] branches/QUADS_QUERY_BRANCH/bigdata/src

Fast, scalable, robust graph database platform

[Bigdata-commit] SF.net SVN: bigdata:[4463] branches/QUADS_QUERY_BRANCH/bigdata/src