[Bigdata-commit] SF.net SVN: bigdata:[3976] branches/QUADS_QUERY_BRANCH/bigdata/src/test/com /bigda

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 3976
          http://bigdata.svn.sourceforge.net/bigdata/?rev=3976&view=rev
Author:   thompsonbry
Date:     2010-11-22 21:12:22 +0000 (Mon, 22 Nov 2010)

Log Message:
-----------
Added a skeleton for an extensible hash map and unit tests.  This gets as far as needing to split a bucket.  The implementation in the test class uses int32 keys and exists just to gain familiarity with extensible hashing and prove out the control logic.

Added Paths:
-----------
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htbl/
    branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htbl/TestExtensibleHashing.java

Added: branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htbl/TestExtensibleHashing.java
===================================================================

--- branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htbl/TestExtensibleHashing.java	                        (rev 0)
+++ branches/QUADS_QUERY_BRANCH/bigdata/src/test/com/bigdata/htbl/TestExtensibleHashing.java	2010-11-22 21:12:22 UTC (rev 3976)
@@ -0,0 +1,877 @@
+/**
+
+Copyright (C) SYSTAP, LLC 2006-2007.  All rights reserved.
+
+Contact:
+     SYSTAP, LLC
+     4501 Tower Road
+     Greensboro, NC 27410
+     lic...@bi...
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; version 2 of the License.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+*/
+/*
+ * Created on Nov 22, 2010
+ */
+package com.bigdata.htbl;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+
+import junit.framework.TestCase2;
+
+/**
+ * Test suite for extensible hashing.
+ * 
+ * <br>
+ * (***) Persistence capable hash table for high volume hash joins.
+ * 
+ * The data will be "rows" in a "relation" modeled using binding sets. We can
+ * use dense encoding of these rows since they have a fixed schema (some columns
+ * may allow nulls). There should also be a relationship to how we encode these
+ * data for network IO.
+ * 
+ * https://sourceforge.net/apps/trac/bigdata/ticket/203
+ * 
+ * 
+ * Extendable hash table:
+ * 
+ * - hash(byte[] key) -> IRaba page. Use IRaba for keys/values and key search.
+ * 
+ * - Split if overflows the bucket size (alternative is some versioning where
+ * the computed hash value indexes into a logical address which is then
+ * translated to an IRawStore address - does the RWStore help us out here?)
+ * 
+ * - ring buffer to wire in hot nodes (but expect random touches).
+ * 
+ * - initially, no history (no versioning). just replace the record when it is
+ * evicted from the ring buffer.
+ * 
+ * What follows is a summary of an extendable hash map design for bigdata. This
+ * covers most aspects of the hash map design, but does not drill deeply into
+ * the question of scale-out hash maps. The immediate goal is to develop a hash
+ * map which can be used for a variety of tasks, primarily pertaining to
+ * analytic query as described above.
+ * 
+ * Extendable hashing is one form of dynamic hashing in which buckets are split
+ * or coalesced as necessary and in which the reorganization is performed on one
+ * bucket at a time.
+ * 
+ * Given a hash function h generating, e.g., int32 values where b is the #of
+ * bits in the hash code. At any point, we use 0 LTE i LTE b bits of that hash
+ * code as an index into a table of bucket addresses. The value of i will change
+ * as the #of buckets changes based on the scale of the data to be addressed.
+ * 
+ * Given a key K, the bucket address table is indexed with i bits of the hash
+ * code, h(K). The value at that index is the address of the hash bucket.
+ * However, several consecutive entries in the hash table may point to the same
+ * hash bucket (for example, the hash index may be created with i=4, which would
+ * give 16 index values but only one initial bucket). The bucket address table
+ * entries which map onto the same hash bucket will have a common bit length,
+ * which may be LTE [i]. This bit length is not stored in the bucket address
+ * table, but each bucket knows its bit length. Given a global bit length of [i]
+ * and a bucket bit length of [j], there will be 2^(i-j) bucket address table
+ * entries which point to the same bucket.
+ * 
+ * Lookup: Compute h(K) and right shift (w/o sign extension) by i bits. Use this
+ * to index into the bucket address table. The address in the table is the
+ * bucket address and may be used to directly read the bucket.
+ * 
+ * Insert: Per lookup. On overflow, we need to split the bucket moving the
+ * existing records (and the new record) into new buckets. How this proceeds
+ * depends on whether the hash #of bits used in the bucket is equal to the #of
+ * bits used to index into the bucket address table. There are two cases:
+ * 
+ * Split case 1: If i (global bits of the hash which are in use) == j (bucket
+ * bits of the hash which are in use), then the bucket address table is out of
+ * space and needs to be resized. Let i := i+1. This doubles the size of the
+ * bucket address table. Each original entry becomes two entries in the new
+ * table. For the specific bucket which is to be split, a new bucket is
+ * allocated and the 2nd bucket address table for that entry is set to the
+ * address of the new bucket. The tuples are then assigned to the original
+ * bucket and the new bucket by considering the additional bit of the hash code.
+ * Assuming that all keys are distinct, then one split will always be sufficient
+ * unless all tuples in the original bucket have the same hash code when their
+ * i+1 th bit is considered. In this case, we resort to an "overflow" bucket
+ * (alternatively, the bucket is allowed to be larger than the target size and
+ * gets treated as a blob).
+ * 
+ * Split case 2: If i is GT j, then there will be at least two entries in the
+ * bucket address table which point to the same bucket. One of those entries is
+ * relabeled. Both the original bucket and the new bucket have their #of bits
+ * incremented by one, but the #of global bits in use does not change. Of the
+ * entries in the bucket address table which used to point to the original
+ * bucket, the 1st half are left alone and the 2nd half are updated to point to
+ * the new bucket. (Note that the #of entries depends on the global #of hash
+ * bits in use and the bucket local #of hash bits in use and will be 2 if there
+ * is a difference of one between those values but can be more than 2 and will
+ * always be an even number). The entries in the original bucket are rehashed
+ * and assigned based on the new #of hash bits to be considered to either the
+ * original bucket or the new bucket. The record is then inserted based on the
+ * new #of hash bits to be considered. If it still does not fit, then either
+ * handle by case (1) or case (2) as appropriate.
+ * 
+ * Note that records which are in themselves larger than the bucket size must
+ * eventually be handled by: (A) using an overflow record; (B) allowing the
+ * bucket to become larger than the target page size (using a larger allocation
+ * slot or becoming a blob); or (C) recording the tuple as a raw record and
+ * maintaining only the full hash code of the tuple and its raw record address
+ * in the bucket (this would allow us to automatically promote long literals out
+ * of the hash bucket and a similar approach might be used for a B+Tree leaf,
+ * except that a long key will still cause a problem [also, this implies that
+ * deleting a bucket or leaf on the unisolated index of the RWStore might
+ * require a scan of the IRaba to identify blob references which must also be
+ * deleted, so it makes sense to track those as part of the bucket/leaf
+ * metadata).
+ * 
+ * Delete: Buckets may be removed no later than when they become empty and doing
+ * this is a local operation with costs similar to splitting a bucket. Likewise,
+ * it is clearly possible to coalesce buckets which underflow before they become
+ * empty by scanning the 2^(i-j) buckets indexed from the entries in the bucket
+ * address table using i bits from h(K). [I need to research handling deletes a
+ * little more, including under what conditions it is cost effective to reduce
+ * the size of the bucket address table itself.]
+ * 
+ * Hash table versioning can be easily implemented by: (a) a checkpoint record
+ * with the address of the bucket address table (which could be broken into a
+ * two level table comprised of 4k pages in order to make small updates faster);
+ * and (b) a store level policy such that we do not overwrite the modified
+ * records directly (though they may be recycled). This will give us the same
+ * consistent read behind behavior as the B+Tree.
+ * 
+ * The IIndex interface will need to be partitioned appropriately such that the
+ * IRangeScan interface is not part of the hash table indices (an isBTree() and
+ * isHashMap() method might be added).
+ * 
+ * While the same read-through views for shards should work with hash maps as
+ * work with B+Tree indices, a different scheme may be necessary to locate those
+ * shards and we might need to use int64 hash codes in scale-out or increase the
+ * page size (at least for the read-only hash segment files, which would also
+ * need a batch build operation). The AccessPath will also need to be updated to
+ * be aware of classes which do not support key-range scans, but only whole
+ * relation scans.
+ * 
+ * Locking on hash tables without versioning should be much simpler than locking
+ * on B+Trees since there is no hierarchy and more operations can proceed
+ * without blocking in parallel.
+ * 
+ * We can represent tuples (key,value pairs) in an IRaba data structure and
+ * reuse parts of the B+Tree infrastructure relating to compression of IRaba,
+ * key search, etc. In fact, we might use to lazy reordering notion from Monet
+ * DB cracking to only sort the keys in a bucket when it is persisted. This is
+ * also a good opportunity to tackling splitting the bucket if it overflows the
+ * target record size, e.g., 4k. We could throw out an exception if the sorted,
+ * serialized, and optionally compressed record exceeds the target record size
+ * and then split the bucket. All of this seems reasonable and we might be able
+ * to then back port those concepts into the B+Tree.
+ * 
+ * We need to estimate the #of tuples which will fit within the bucket. We can
+ * do this based on: (a) the byte length of the keys and values (key compression
+ * is not going to help out much for a hash index since the keys will be evenly
+ * distributed even if they are ordered within a bucket); (b) the known per
+ * tuple overhead and per bucket overhead; (c) an estimate of the compression
+ * ratio for raba encoding and record compression. This estimate could be used
+ * to proactively split a bucket before it is evicted. This is most critical
+ * before anything is evicted as we would otherwise have a single very large
+ * bucket. So, let's make this simple and split the bucket if the sum of the key
+ * + val bytes exceeds 120% of the target record size (4k, 8k, etc). The target
+ * page size can be a property of the hash index. [Note: There is an implicit
+ * limit on the size of a tuple with this approach. The alternative is to fix
+ * the #of tuples in the bucket and allow buckets to be of whatever size they
+ * are for the specific data in that bucket.]
+ * 
+ * - RWStore with "temporary" quality. Creates the backing file lazily on
+ * eviction from the write service.
+ * 
+ * - RWStore with "RAM" only? (Can not exceed the #of allocated buffers or can,
+ * but then it might force paging out to swap?)
+ * 
+ * - RWStore with "RAM" mostly. Converts to disk backed if uses all those
+ * buffers. Possibly just give the WriteCacheService a bunch of write cache
+ * buffers (10-100) and have it evict to disk *lazily* rather than eagerly (when
+ * the #of free buffers is down to 20%).
+ * 
+ * - RWStore with memory mapped file? As I recall, the problem is that we can
+ * not guarantee extension or close of the file under Java. But some people seem
+ * to make this work...
+ */
+public class TestExtensibleHashing extends TestCase2 {
+
+	public TestExtensibleHashing() {
+	}
+
+	public TestExtensibleHashing(String name) {
+		super(name);
+	}
+
+	/**
+	 * Find the first power of two which is GTE the given value. This is used to
+	 * compute the size of the address space (in bits) which is required to
+	 * address a hash table with that many buckets.
+	 */
+	private static int getMapSize(final int initialCapacity) {
+
+		if (initialCapacity <= 0)
+			throw new IllegalArgumentException();
+		
+		int i = 1;
+
+		while ((1 << i) < initialCapacity)
+			i++;
+		
+		return i;
+
+	}
+
+	/**
+	 * Unit test for {@link #getMapSize(int)}.
+	 */
+	public void test_getMapSize() {
+		
+		assertEquals(1/* addressSpaceSize */, getMapSize(1)/* initialCapacity */);
+		assertEquals(1/* addressSpaceSize */, getMapSize(2)/* initialCapacity */);
+		assertEquals(2/* addressSpaceSize */, getMapSize(3)/* initialCapacity */);
+		assertEquals(2/* addressSpaceSize */, getMapSize(4)/* initialCapacity */);
+		assertEquals(3/* addressSpaceSize */, getMapSize(5)/* initialCapacity */);
+		assertEquals(3/* addressSpaceSize */, getMapSize(6)/* initialCapacity */);
+		assertEquals(3/* addressSpaceSize */, getMapSize(7)/* initialCapacity */);
+		assertEquals(3/* addressSpaceSize */, getMapSize(8)/* initialCapacity */);
+		assertEquals(4/* addressSpaceSize */, getMapSize(9)/* initialCapacity */);
+
+		assertEquals(5/* addressSpaceSize */, getMapSize(32)/* initialCapacity */);
+
+		assertEquals(10/* addressSpaceSize */, getMapSize(1024)/* initialCapacity */);
+
+	}
+
+	/**
+	 * Return a bit mask which reveals only the low N bits of an int32 value.
+	 * 
+	 * @param nbits
+	 *            The #of bits to be revealed.
+	 * @return The mask.
+	 */
+	private static int getMaskBits(final int nbits) {
+
+		if (nbits < 0 || nbits > 32)
+			throw new IllegalArgumentException();
+
+//		int mask = 1; // mask
+//		int pof2 = 1; // power of two.
+//		while (pof2 < nbits) {
+//			pof2 = pof2 << 1;
+//			mask |= pof2;
+//		}
+
+		int mask = 0;
+		int bit;
+		
+        for (int i = 0; i < nbits; i++) {
+
+            bit = (1 << i);
+            
+            mask |= bit;
+            
+        }
+
+//		System.err.println(nbits +" : "+Integer.toBinaryString(mask));
+
+		return mask;
+
+	}
+
+	/**
+	 * Unit test for {@link #getMaskBits(int)}
+	 */
+	public void test_getMaskBits() {
+
+		assertEquals(0x00000001, getMaskBits(1));
+		assertEquals(0x00000003, getMaskBits(2));
+		assertEquals(0x00000007, getMaskBits(3));
+		assertEquals(0x0000000f, getMaskBits(4));
+		assertEquals(0x0000001f, getMaskBits(5));
+		assertEquals(0x0000003f, getMaskBits(6));
+		assertEquals(0x0000007f, getMaskBits(7));
+		assertEquals(0x000000ff, getMaskBits(8));
+
+		assertEquals(0x0000ffff, getMaskBits(16));
+
+		assertEquals(0xffffffff, getMaskBits(32));
+		
+	}
+	
+//	private static int[] getMaskArray() {
+//
+//	}
+	
+	/**
+	 * Extensible hashing data structure.
+	 * 
+	 * @todo allow duplicate tuples - caller can enforce distinct if they like.
+	 * 
+	 * @todo automatically promote large tuples into raw record references,
+	 *       leaving the hash code of the key and the address of the raw record
+	 *       in the hash bucket.
+	 * 
+	 * @todo initially manage the address table in an int[].
+	 * 
+	 * @todo use 4k buckets. split buckets when the sum of the data is GT 4k
+	 *       (reserve space for a 4byte checksum). use a compact record
+	 *       organization. if a tuple is deleted, bit flag it (but immediately
+	 *       delete the raw record if one is associated with the tuple). before
+	 *       splitting, compact the bucket to remove any deleted tuples.
+	 * 
+	 * @todo the tuple / raw record promotion logic should be shared with the
+	 *       B+Tree. The only catch is that large B+Tree keys will always remain
+	 *       a stress factor. For example, TERM2ID will have large B+Tree keys
+	 *       if TERM is large and promoting to a blob will not help. In that
+	 *       case, we actually need to hash the TERM and store the hash as the
+	 *       key (or index only the first N bytes of the term).
+	 */
+	public static class ExtensibleHashBag {
+		
+	}
+
+	/**
+	 * An implementation of an extensible hash map using a 32 bit hash code and
+	 * a fixed length int[] for the bucket. The keys are int32 values. The data
+	 * stored in the hash map is just the key. Buckets provide a perfect fit for
+	 * N keys. This is used to explore the dynamics of the extensible hashing
+	 * algorithm using some well known examples.
+	 * <p>
+	 * This implementation is not thread-safe. I have not attempted to provide
+	 * for visibility guarantees when resizing the map and I have not attempted
+	 * to provide for concurrent updates. The implementation exists solely to
+	 * explore the extensible hashing algorithm.
+	 * <p>
+	 * The hash code
+	 */
+	private static class SimpleExtensibleHashMap {
+
+		/**
+		 * The #of int32 positions which are available in a {@link SimpleBucket}
+		 * .
+		 */
+		private final int bucketSize;
+
+		/**
+		 * The #of hash code bits which are in use by the {@link #addressMap}.
+		 * Each hash bucket also as a local #of hash bits. Given <code>i</code>
+		 * is the #of global hash bits and <code>j</code> is the number of hash
+		 * bits in some bucket, there will be <code>2^(i-j)</code> addresses
+		 * which point to the same bucket.
+		 */
+		private int globalHashBits;
+
+		/**
+		 * The size of the address space (#of buckets addressable given the #of
+		 * {@link #globalHashBits} in use).
+		 */
+		private int addressSpaceSize;
+		
+		/**
+		 * The address map. You index into this map using
+		 * {@link #globalHashBits} out of the hash code for a probe key. The
+		 * value of the map is the index into the {@link #buckets} array of the
+		 * bucket to which that key is hashed.
+		 */
+		private int[] addressMap;
+
+		/**
+		 * The buckets. The first bucket is pre-allocated when the address table
+		 * is setup and all addresses in the table are initialized to point to
+		 * that bucket. Thereafter, buckets are allocated when a bucket is
+		 * split.
+		 */
+		private final ArrayList<SimpleBucket> buckets;
+
+		/**
+		 * An array of mask values. The index in the array is the #of bits of
+		 * the hash code to be considered. The value at that index in the array
+		 * is the mask to be applied to mask off to zero the high bits of the
+		 * hash code which are to be ignored.
+		 */
+		private final int[] masks;
+		
+		/**
+		 * The current mask for the current {@link #globalHashBits}.
+		 */
+		private int globalMask;
+		
+		/**
+		 * 
+		 * @param initialCapacity
+		 *            The initial capacity is the #of buckets which may be
+		 *            stored in the hash table before it must be resized. It is
+		 *            expressed in buckets and not tuples because there is not
+		 *            (in general) a fixed relationship between the size of a
+		 *            bucket and the #of tuples which can be stored in that
+		 *            bucket. This will be rounded up to the nearest power of
+		 *            two.
+		 * @param bucketSize
+		 *            The #of int tuples which may be stored in a bucket.
+		 */
+		public SimpleExtensibleHashMap(final int initialCapacity, final int bucketSize) {
+			
+			if (initialCapacity <= 0)
+				throw new IllegalArgumentException();
+
+			if (bucketSize <= 0)
+				throw new IllegalArgumentException();
+			
+			this.bucketSize = bucketSize;
+
+			/*
+			 * Setup the hash table given the initialCapacity (in buckets). We
+			 * need to find the first power of two which is GTE the
+			 * initialCapacity.   
+			 */
+			globalHashBits = getMapSize(initialCapacity);
+
+			if (globalHashBits > 32) {
+				/*
+				 * The map is restricted to 32-bit hash codes so we can not
+				 * address this many buckets.
+				 */
+				throw new IllegalArgumentException();
+			}
+
+			// Populate the array of masking values.
+			masks = new int[32];
+			
+			for (int i = 0; i < 32; i++) {
+
+				masks[i] = getMaskBits(i);
+				
+			}
+
+			// save the current masking value for the current #of global bits.
+			globalMask = masks[globalHashBits];
+
+			/*
+			 * Now work backwards to determine the size of the address space (in
+			 * buckets).
+			 */
+			addressSpaceSize = 1 << globalHashBits;
+
+			/*
+			 * Allocate and initialize the address space. All indices are
+			 * initially mapped onto the same bucket.
+			 */
+			addressMap = new int[addressSpaceSize];
+
+			buckets = new ArrayList<SimpleBucket>(addressSpaceSize/* initialCapacity */);
+			
+			buckets.add(new SimpleBucket(1/* localHashBits */, bucketSize));
+
+		}
+
+//		private void toString(StringBuilder sb) {
+//			sb.append("addressMap:"+Arrays.toString(addressMap));
+//		}
+		
+		/** The hash of an int key is that int. */
+		private int hash(final int key) {
+			return key;
+		}
+		
+		/** The bucket address given the hash code of a key. */
+		private int addrOf(final int h) {
+
+			final int maskedOffIndex = h & globalMask;
+
+			return addressMap[maskedOffIndex];
+			
+		}
+
+		/**
+		 * Return the pre-allocated bucket having the given address.
+		 * 
+		 * @param addr
+		 *            The address.
+		 *            
+		 * @return The bucket.
+		 */
+		private SimpleBucket getBucket(final int addr) {
+
+			return buckets.get(addr);
+			
+		}
+
+		/**
+		 * The #of hash bits which are being used by the address table.
+		 */
+		public int getGlobalHashBits() {
+			
+			return globalHashBits;
+			
+		}
+
+		/**
+		 * The size of the address space (the #of positions in the address
+		 * table, which is NOT of necessity the same as the #of distinct buckets
+		 * since many address positions can point to the same bucket).
+		 */
+		public int getAddressSpaceSize() {
+
+			return addressSpaceSize;
+			
+		}
+		
+		/**
+		 * The #of buckets backing the map.
+		 */
+		public int getBucketCount() {
+			
+			return buckets.size();
+			
+		}
+		
+		/**
+		 * The size of a bucket (the #of int32 values which may be stored
+		 * in a bucket).
+		 */
+		public int getBucketSize() {
+			
+			return bucketSize;
+			
+		}
+		
+		/**
+		 * Return <code>true</code> iff the hash table contains the key.
+		 * 
+		 * @param key
+		 *            The key.
+		 *            
+		 * @return <code>true</code> iff the key was found.
+		 */
+		public boolean contains(final int key) {
+			final int h = hash(key);
+			final int addr = addrOf(h);
+			final SimpleBucket b = getBucket(addr);
+			return b.contains(h,key);
+		}
+
+		/**
+		 * Insert the key into the hash table. Duplicates are allowed.
+		 * 
+		 * @param key
+		 *            The key.
+		 * 
+		 * @todo define a put() method which returns the old value (no
+		 *       duplicates). this could be just sugar over contains(), delete()
+		 *       and insert().
+		 */
+		public void insert(final int key) {
+			final int h = hash(key);
+			final int addr = addrOf(h);
+			final SimpleBucket b = getBucket(addr);
+			b.insert(h,key);
+		}
+
+		/**
+		 * Delete the key from the hash table (in the case of duplicates, a
+		 * random entry having that key is deleted).
+		 * 
+		 * @param key
+		 *            The key.
+		 * 
+		 * @return <code>true</code> iff a tuple having that key was deleted.
+		 * 
+		 * @todo return the deleted tuple.
+		 */
+		public boolean delete(final int key) {
+			final int h = hash(key);
+			final int addr = addrOf(h);
+			final SimpleBucket b = getBucket(addr);
+			return b.delete(h,key);
+		}
+		
+		/**
+		 * Visit the buckets.
+		 * <p>
+		 * Note: This is NOT thread-safe!
+		 */
+		public Iterator<SimpleBucket> buckets() {
+			return buckets.iterator();
+		}
+		
+	}
+	
+	/**
+	 * A (very) simple hash bucket.  The bucket holds N int32 keys.
+	 */
+	private static class SimpleBucket {
+
+		/** The #of hash code bits which are in use by this {@link SimpleBucket}. */
+		int localHashBits;
+
+		/**
+		 * The #of keys stored in the bucket. The keys are stored in a dense
+		 * array. For a given {@link #size}, the only indices of the array which
+		 * have any data are [0:{@link #size}-1].
+		 */
+		int size;
+		
+		/**
+		 * The user data for the bucket.
+		 */
+		final int[] data;
+		
+		public SimpleBucket(final int localHashBits,final int bucketSize) {
+
+			if (localHashBits <= 0 || localHashBits > 32)
+				throw new IllegalArgumentException();
+
+			this.localHashBits = localHashBits;
+			
+			this.data = new int[bucketSize];
+			
+		}
+
+		/**
+		 * Return <code>true</code> if the bucket contains the key.
+		 * 
+		 * @param h
+		 *            The hash code of the key.
+		 * @param key
+		 *            The key.
+		 * 
+		 * @return <code>true</code> if the key was found in the bucket.
+		 * 
+		 * @todo passing in the hash code here makes sense when the bucket
+		 *       stores the hash values, e.g., if we always do that or if we
+		 *       have an out of bucket reference to a raw record because the
+		 *       tuple did not fit in the bucket.
+		 */
+		public boolean contains(final int h, final int key) {
+
+			for (int i = 0; i < size; i++) {
+
+				if (data[i] == key)
+					return true;
+
+			}
+
+			return false;
+			
+		}
+
+		/**
+		 * Insert the key into the bucket (duplicates are allowed). It is an
+		 * error if the bucket is full.
+		 * 
+		 * @param h
+		 *            The hash code of the key.
+		 * @param key
+		 *            The key.
+		 */
+		public void insert(final int h, final int key) {
+
+			if (size == data.length) {
+				/*
+				 * The bucket must be split, potentially recursively.
+				 * 
+				 * Note: Splits need to be triggered based on information which
+				 * is only available to the bucket when it considers the insert
+				 * of a specific tuple, including whether the tuple is promoted
+				 * to a raw record reference, whether the bucket has deleted
+				 * tuples which can be compacted, etc.
+				 * 
+				 * @todo I need to figure out where the control logic goes to
+				 * manage the split. If the bucket handles splits, then we need
+				 * to pass in the table reference.
+				 */
+				throw new UnsupportedOperationException();
+			}
+			
+			data[size++] = key;
+	
+		}
+
+		/**
+		 * Delete a tuple having the specified key. If there is more than one
+		 * such tuple, then a random tuple having the key is deleted.
+		 * 
+		 * @param h
+		 *            The hash code of the key.
+		 * @param key
+		 *            The key.
+		 * 
+		 * @todo return the delete tuple.
+		 */
+		public boolean delete(final int h, final int key) {
+
+			for (int i = 0; i < size; i++) {
+
+				if (data[i] == key) {
+
+					// #of tuples remaining beyond this point.
+					final int length = size - i - 1;
+
+					if (length > 0) {
+
+						// Keep the array dense by copying down by one.
+						System.arraycopy(data, i + 1/* srcPos */, data/* dest */,
+								i/* destPos */, length);
+
+					}
+
+					size--;
+					
+					return true;
+
+				}
+
+			}
+
+			return false;
+
+		}
+
+	}
+
+	/**
+	 * Map constructor tests.
+	 */
+	public void test_ctor() {
+
+		final SimpleExtensibleHashMap map = new SimpleExtensibleHashMap(
+				1/* initialCapacity */, 3/* bucketSize */);
+	
+		assertEquals("globalHashBits", 1, map.getGlobalHashBits());
+		
+		assertEquals("addressSpaceSize", 2, map.getAddressSpaceSize());
+
+		assertEquals("bucketCount", 1, map.getBucketCount());
+		
+		assertEquals("bucketSize", 3, map.getBucketSize());
+		
+	}
+	
+	/**
+	 * Simple CRUD test operating against the initial bucket without triggering
+	 * any splits.
+	 */
+	public void test_crud1() {
+	
+		final SimpleExtensibleHashMap map = new SimpleExtensibleHashMap(
+				1/* initialCapacity */, 3/* bucketSize */);
+
+		// a bunch of things which are not in the map.
+		for (int i : new int[] { 0, 1, -4, 31, -93, 912 }) {
+			
+			assertFalse(map.contains(i));
+			
+		}
+		
+		/*
+		 * Insert a record, then delete it, verifying that contains() reports
+		 * true or false as appropriate for the pre-/post- conditions.
+		 */
+
+		assertFalse(map.contains(83));
+
+		map.insert(83);
+		
+		assertTrue(map.contains(83));
+		
+		map.delete(83);
+		
+		assertFalse(map.contains(83));
+		
+	}
+
+	/**
+	 * CRUD test which inserts some duplicate tuples, but not enough to split
+	 * the initial bucket, and the deletes them out again.
+	 */
+	public void test_crud2() {
+	
+		final SimpleExtensibleHashMap map = new SimpleExtensibleHashMap(
+				1/* initialCapacity */, 3/* bucketSize */);
+
+		assertEquals("bucketCount", 1, map.getBucketCount());
+		
+		assertFalse(map.contains(83));
+
+		// insert once.
+		map.insert(83);
+		
+		assertTrue(map.contains(83));
+		
+		// insert again.
+		map.insert(83);
+		
+		assertTrue(map.contains(83));
+		
+		// did not split the bucket.
+		assertEquals("bucketCount", 1, map.getBucketCount());
+
+		// delete once.
+		map.delete(83);
+
+		// still found.
+		assertTrue(map.contains(83));
+
+		// delete again.
+		map.delete(83);
+
+		// now gone.
+		assertFalse(map.contains(83));
+		
+	}
+	
+	/**
+	 * Test repeated insert of a key until the bucket splits.
+	 */
+	public void test_split() {
+
+		final int bucketSize = 3;
+		
+		final SimpleExtensibleHashMap map = new SimpleExtensibleHashMap(
+				1/* initialCapacity */, bucketSize);
+
+		assertEquals("bucketCount", 1, map.getBucketCount());
+
+		map.insert(83);
+		map.insert(83);
+		map.insert(83);
+
+		// still not split.
+		assertEquals("bucketCount", 1, map.getBucketCount());
+
+		// force a split.
+		map.insert(83);
+
+		assertEquals("bucketCount", 2, map.getBucketCount());
+
+	}
+	
+	/**
+	 * Unit test with the following configuration and insert / event sequence:
+	 * <ul>
+	 * <li>bucket size := 4k</li>
+	 * <li></li>
+	 * <li></li>
+	 * <li></li>
+	 * </ul> 
+	 * <pre>
+	 * </pre>
+	 */
+	public void test_simple() {
+		
+	}
+	
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Bigdata-commit] SF.net SVN: bigdata:[3976] branches/QUADS_QUERY_BRANCH/bigdata/src/test/com /bigda

Fast, scalable, robust graph database platform

[Bigdata-commit] SF.net SVN: bigdata:[3976] branches/QUADS_QUERY_BRANCH/bigdata/src/test/com /bigdata