1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Ticket #203 (closed enhancement: fixed)

Opened 3 years ago

Last modified 19 months ago

Implement an persistence capable hash table to support analytic query

Reported by: thompsonbry Owned by: martyncutcher
Priority: critical Milestone: Query
Component: Query Engine Version: TERMS_REFACTOR_BRANCH
Keywords: hash table Cc: mrpersonick, martyncutcher

Description

A hash table provides O(1) lookup, but does not support range scans. However, a hash table is suitable for a number of index requirements. For example, it would be a better fit for the ID2TERM and TERM2ID indices in the triple/quad store lexicon. A hash table is also required for various kinds of analytic query operations, including hash-joins, high data volume distincts, and high data volume fact tables for aggregation queries. The java hash table implementations are suitable for transient data structures as long as the data scale is small (when compared to the scale of a database index). However, Java garbage collectors are strongly challenged when presented with lots of long lived objects such as would be generated by a hash join. Therefore, we need to implement a persistence capable hash table to support analytic query (getting the data out of the Java heap and into either direct ByteBuffers? or onto the disk). The hash table can also be used to improve lexicon performance.

There are two broad categories of hash maps: static and dynamic. All
in all, static hashing is problematic unless the maximum #of records is known in advance. The broad class of dynamic hashing includes extendable (or extensible) hashing, which is based based on an address table) on the one hand and linear hashing and extended) spiral hashing, which do not use an address table, on the other hand. (There are also some hybrid designs. For example, in bounded disorder a B+Tree is combined with hash tables for the leaf nodes which are substantially larger than the B+Tree page size (being made up multiple pages themselves). Bounded disorder trees are an interesting hybrid which supports out of order key-range scans and has many of the benefits of a hash table (access to a record in one or two IOs assuming that the upper part of the B+Tree nodes is wired into memory). However, the index segment already possesses many of these characteristics as the nodes region is fully buffered in memory and we do one IO (guaranteed) per leaf.)

Considering the dynamic hashing algorithms, it seems likely that the choice of the algorithm will come down to whether we have a bare file index format for hashing (extended spiral hashing might be the best choice if we develop new persistence store specifically for this purpose), whether we use RWStore for page allocation management, and whether or not the hash table is integrated with the MVCC architecture (based on copy-on-write combined with eventual release or recycling of deleted records).

Change History

Changed 3 years ago by thompsonbry

  • status changed from new to accepted

Added a skeleton for an extensible hash map and unit tests. This gets as far as needing to split a bucket. The implementation in the test class uses int32 keys and exists just to gain familiarity with extensible hashing and prove out the control logic.

Changed 2 years ago by thompsonbry

Refactored to support persistence, but I have not yet defined the data record formats for the directory and bucket nodes. I am currently looking into extendible multi-level hashing designs for handling skewed data as well as managing larger address spaces in a persistence capable manner.

Changed 2 years ago by thompsonbry

I've decided to go with a model for the HTree which uses the same variable length unsigned byte[] keys. I think that that we are likely to use int32 hash codes for the vast majority of things and this will be more complicated than is required in such cases, but it will allow us to reuse the same APIs and data structures to manage the HTree to the greatest extent. I do not think that the run time cost will be too high as we will have the same compact and directly indexable persistent data structures that we use for the B+Tree. This will also make it possible for us to reuse the same logic with hash codes having greater than int32 bits, which could be valuable in very large or hash distributed hash tables.

To that end, I've added a few methods to BytesUtil? to print the binary representation of a byte[] (signed or unsigned does not matter when you are printing bits), and to obtain a slice of up to 32-bits from an unsigned byte[]. The latter will be used to extract the bits to be considered at a given level of the HTree. This will allow us to index into the individual hash directories using an int32 value with the appropriate mask to reveal only those bits which are in play at a given level of the tree.

This getBits(byte[],off,len) method uses semantics where bit index ZERO is the first MSB bit in byte[0] of the array. Those MSB semantics are consistent with grouping child pointers for the buddy pages in a contiguous part of the address space of the parent hash directory spanning those buddy pages. This is contrary to the initial prototype of the extensible hash table which had bit ZERO of the int32 word as the LSB bit rather than the MSB bit.

My next step here will be to provide a prototype of the logic for maintaining the HTree against a set of examples which I have begun to develop in a worksheet. I am going to try developing those examples using a 4-bit hash code as this will greatly simplify the state space of the hash tree while still allowing us to explore a hash tree with 3 levels.

Changed 2 years ago by thompsonbry

Pulled out interfaces for:

- IChildData -- access to child nodes.
- ITupleCountData -- access to the #of spanned tuples for a node or leaf.
- IKeysData -- access to keys associated with an index node.

Unlike the B+Tree, the htree directory does not provide access to the #of spanned tuples or the keys. This refactoring of the data access interfaces reflects this difference in the nodes of the two index structures.

These changes are not yet committed.

I have updated the math for computing local depth in the worksheet and the code. The updated worksheet is committed, but I have not yet committed to code changes as I want to reach a clear consistent point before doing that.

Changed 2 years ago by thompsonbry

Continued work on the htree. At this point I believe that I have all of the relevant bit math working with appropriate unit tests (except for BytesUtil#maskOffLSB?). This is for the HTree in contrast to the extensible hashing approach. Some of the significant changes include support for hierarchy, support for unsigned byte[] keys (with a planned optimized code path and persistence for int32 keys), and changing the bit indexing semantics such that the MSB of the key is at bit zero (rather than the LSB).

This commit includes a refactoring of the internal interfaces for the B+Tree to support some of the differences between the BTree and the HTree.

The next steps on the HTree are to write the recursive descent and tuple management logic, followed by handling splits in the buddy buckets and buddy hash tables and the gradual deepening of the hash tree.

Some of the mechanisms for persistence have been incorporated into the hash tree, but there needs to be a reconcile between the HTree and the BTree at some point so they can share more code (abstract classes) and interfaces for generalized persistent indices.

Changed 2 years ago by thompsonbry

Added HTree support for lookupFirst() and lookupAll(), insert() with split of the bucket (but not yet introduction of a new directory level), and contains() along with basic unit tests for the same.

Next steps will be:

1. A split of a bucket page consisting of a single buddy bucket (globalDepth==localDepth). This requires the introduction of a new directory page.

2. Remove.

3. More detailed tests, stress tests, etc.

Note that the data structure does not yet support persistence. I am working to get the core logic for the HTree maintence correct and tested first. I can then bring in the additional infrastructure to support incremental writes, checkpoint records, etc. Much of this is already stubbed out, but not yet implemented.

Changed 2 years ago by thompsonbry

  • priority changed from major to critical

Raised priority.

Changed 2 years ago by thompsonbry

  • version changed from QUADS_QUERY_BRANCH to TERMS_REFACTOR_BRANCH

Changed 2 years ago by thompsonbry

Worked through the split of a directory page when localDepth LT globalDepth

Changed 23 months ago by thompsonbry

I have the HTree working through the recursive creation of levels up to 5. I've started to write some stress tests. These stress tests indicate that we still have some errors in the control logic for the HTree which I will work on next.

Committed revision r4827.

Changed 23 months ago by thompsonbry

Increased coverage on unit tests for MutableKeyBuffer? and MutableValueBuffer?.

Restricted the maximum allowed addressBits to 216. This is a page with a fan out of 65536. A normal fan out is 1024 for page in 4k ~ 8k. addressBits was previously permitted up to 32, which wraps an int32 value (232 exceeds max signed int).

Committed revision r4829.

Changed 23 months ago by thompsonbry

Martyn is going to write a method to return the prefix bit length required to split a full bucket page. The method will be aware of the #of address bits and hence can compute the implied depth of the parent as the prefix length increases. This information will be used to decide when the distinction made by the prefix bits would result in the tuples being reindexed between the full bucket page and its new sibling such that no buddy bucket on the new page would overflow. This method can be used to decide the necessary structural changes to the tree (add level and split directory) required to achieve sufficient depth to split the full bucket page.

The alternative approach would be to rewrite the reindex method to have zero side effects if it fails. This could be done by modifying the existing implementation which already does not have a side effect on the original bucket's data (it swaps in a new data buffer). The implementation would also have to be extended to be aware of the depth of the parent and hence the implied #of buddies on the bucket page. It would fail if the reindex caused any buddy bucket on either the original page or its sibling to overflow. If it failed, it would restore the data buffer reference for the page to be split.

Changed 23 months ago by thompsonbry

Modified the operation which splits a full buddy bucket and indexes the tuples between the original bucket and a new sibling bucket to have no side effects if the operation fails due to the overflow of a buddy hash bucket on either the original bucket page or the new sibling bucket page. I've added one unit test for this which shows that it correctly rejects a situation in which the operation should fail and verifies that there are no side-effects for this case. I have outlined (but not implemented) additional unit tests for success. However, we need to develop a slightly different insert sequence to verify operation success. Either insert different keys such that the split and re-index is valid or develop the htree structure incrementally until the operation is valid and then verify that it succeeds.

Committed revision r4849.

Changed 23 months ago by thompsonbry

Martyn and I are having an in-depth investigation of the logic for addLevel and split-and-reindex bucket and also the logic concerning the computation of the local depth of the child in an attempt to resolve some uncertainty in the design. Martyn's going to continue to look into this for the moment and we will sync up again tomorrow morning on this.

Changed 23 months ago by thompsonbry

Continuing to work through the HTree structural transforms as bucket pages are split, new directory levels are added, and directory pages are split. We have not yet provided the integration with the persistence engine, but that will come quickly once the structural transforms are properly unit and stress tested.

Changed 23 months ago by thompsonbry

Continuing to work through unit tests.

A values() iterator has been implemented and is being incorporated into the unit tests.

A pretty printer has been added for the HTree state. It produces an output similar to the way in which we are representing the HTree internal state in the javadocs.

Changed 23 months ago by thompsonbry

Wrote a method to reindex all tuples below a directory page using a temporary htree. This has not been tested yet. I still need to setup the modified htree constructor which accepts the prefixLength, depth(root), and depth(root.parent) and I still need to modify insert() to use that information. The additional fields provided to the constructor are only used when the temporary htree is being used to rebuild a directory page (and the spanned child pages). They provide sufficient information to index into the root of the temporary htree when its depth is LT address bits. (The root directory page of a normal HTree is always at depth := addressBits. However, we need to setup the temporary HTree differently since we want to build a tree which can be stolen and spliced into the original tree.)

Committed revision r4900.

Changed 23 months ago by thompsonbry

We've been in touch with the algorithm authors to clarify the approach. It appears that we need to be handling add level and split directory from the bottom rather than the top. I'll take a look at what these changes mean in practice, but this should address the concern that we have to do non-local reindexing of bucket pages when we do an add level or split a directory page. The changes themselves should be simple enough to put into practice with near 100% code reuse outside of the test suite.

Changed 23 months ago by thompsonbry

new addLevel() implementation is done. we are now working on recursion when addLevel() is unable to reindex all tuples because they still target the same buddy bucket and it overflows. recursion should logically go through the top-level insert, but pragmatically we need a "raw" insert which carries "raw" tuple representation, including whether or not it represents a raw record on the backing store, the version counter, the delete marker, etc.

Changed 23 months ago by thompsonbry

TestHTree#addLevel() is failing again. This is because the recursive insert logic goes through insertRawTuple() rather than selecting the buddy offset itself. However, both BucketPage#insert?() and BucketPage#insertRawTuple?() are not computing the buddyOffset into the page correctly for the insert. The calling routine (insert() or insertRawTuple() is passing the buddyOffset into the parent, but the BucketPage#insert?() and BucketPage#insertRawTuple?() routines need to using the buddyOffset in the BucketPage?. Also, there is some confusion in the javadoc for the BucketPage#insert?() method concerning buddyOffset, specifically whether it is the index of the buddy bucket or the index of the slot at which the buddy bucket starts on the page.

The insert stress test runs up to 10 but breaks by 20. If run to 100 it shows a number of tuples (or at least keys) which are being incorrectly duplicated through the tree. I suspect that this has to do with the insert() problem described in this comment and, perhaps, with similar fence posts in the other methods (lookup, contains, etc).

All in all, we are very close to a running stress test after which we can focus on the persistence store integration.

Committed revision r4918.

Changed 23 months ago by thompsonbry

We have the HTree running through an insert stress test now with addressBits := 2. There are some errors when addressBits=1 (the minimum possible) and when addressBits := 10 (what we will normally use as this corresponds to a 4 ~ 8k page).

Martyn is going to look at those issues next and then continue to refine the index maintenance.

I am going to do the raw record support unit tests and the persistence store integration now. That should give us something that we can use in aggregation operators in very short order while we continue to refine the index maintenance over time.

Changed 23 months ago by thompsonbry

Folded the persistence logic into the htree. The integration with the persistence store is completely untested at this point. I have reserved from this commit about 7 files which were copied from the BTree package and need to be ported to verify persistence of the HTree.

This commit also incorporates various work by Martyn. The insert stress test is now running up to quite large values. Martyn is going to focus on some of the HTree details while I iron out the persistence store integration.

Committed revision r4926.

Changed 23 months ago by thompsonbry

More work on the persistence store integration for the HTree. Raw records are working. Several new unit tests were successfully ported from the btree package. All tests which were running are still running, but some of the newly ported tests do not succeed (typically due to persistence integration work which is ongoing).

The main blockers for persistence at this time are:

- DefaultLeafCoder? insists that the keys IRaba should report true for isKeys(). We will be either moving to cracking (potentially with a final stable sort) or maintaining an insertion order. Both of those will help in the sense that the keys of the HTree will be more similar to the keys of the BTree, but there can still be duplicate keys in the HTree so it does not have quite the same semantics. This distinction might be addressed by introducing another nuance into the IRaba or by relaxing DefaultLeafCoder?. Certainly, some IRabaCoders will only work with ordered data and many work best with ordered data, e.g., the prefix compression. All should work just fine with duplicate keys.

- Memoizer needs to be integration into DirectoryPage? and AbstractHTree per Node and AbstractBTree.

- NodeFactory? needs to mint new DirectoryPages? and BucketPages?. We may need to pass the addressBits in, which would be an API change to INodeFactory. However, the htree package has its own NodeFactory? and INodeFactory so that should be Ok.

- The unit tests for persistence for the btree package are very close to the BTree data structure and will need to be revisited in some detail to create an appropriate test for the HTree.

Committed revision r4934.

Changed 23 months ago by thompsonbry

Another issue with porting the unit tests is that we have not yet implemented "remove" methods for the HTree. While I do not see this as a requirement for the aggregation use cases, some classes, such as TestTransientHTree, rely on the ability to delete data.

We also lack a good "assertSameHTree()" method, which is why TestReopen? is failing.

Changed 23 months ago by thompsonbry

Brought across the memorizer stuff to the htree and integrated it into AbstractHTree and DirectoryPage?. All tests which were passing are still passing, but nothing is really testing materialization yet because we are not yet able to write out records ("Must be keys" error).

Committed revision r4937.

Changed 23 months ago by thompsonbry

Integrated the NodeFactory? and added appropriate constructors for Bucket and Directory pages.

Committed revision r4938.

Changed 23 months ago by thompsonbry

Basic persistence is now working. I had to write a coder for the directory pages. This still needs a lot more testing, including dedicated unit tests for the directory page coder.

Committed revision r4941.

Changed 23 months ago by thompsonbry

Added the copy-on-write pattern into BucketPage? and DirectoryPage?. Note that addLevel() really should be moved onto BucketPage? in keeping with this pattern. TestCommit? is now passing. All tests which were passing are still passing.

Bug fix to the DirectoryPage? deserialization constructor. It was failing to set the address on the deserialized page.

Added an integration test with the MemoryManager? using MemStore?. This demonstrates some problems with the incremental eviction which I have not yet debugged. Many of the tests in TestHTreeWithMemStore will pass if you increase the write retention queue capacity, but that is because incremental evictions are not occurring. As an aid to demonstrating problems, I reduced the write retention queue for this test from the default (500) to 10.

Modified SimpleRabaCoder? to track the IRaba capacity as well as its size. The old serialized version assumed that capacity == size on decode, but that was breaking some assumptions with regard to the bucket page capacity. (We might be able to work around those assumptions in the future if we keep the bucket page dense, but that is not currently guaranteed.)

Removed the invocation of tree.touch() in Leaf/BucketPage prior to the invocation of the copy-on-write pattern. copyOnWrite() was invoking touch() so this was just extra work and touch() is a hot spot.

Committed revision r4943.

Changed 23 months ago by thompsonbry

- DirectoryPage?: copy-on-write constructor was initializing with 1<<addressBits rather than addressBits.

- DirtyChildIterator?: modified to skip over references to the same child.

A problem has been observed where addLevel() can cause sufficient evictions that the parent directory becomes read-only again. This needs to be explored in more depth. We may need to "wire" pages which are being modified during structural transforms. E.g., with a set of such page references. Evictions for wired pages would be ignored. The interaction of the recursive handling of insert(), add level, and split level needs to be discussed further in the context of the copy-on-write and page eviction mechanisms.

Committed revision r4946.

Changed 22 months ago by thompsonbry

Worked through the test suite for the dirty direct child and post-order iterators, including the interaction with writeNodeRecursive().

Changes to PP() formatting to indicate whether a page is clean or dirty (a '*' suffix).

Committed revision r4947.

Changed 22 months ago by thompsonbry

Partial clean up of the remaining classes in the htree test suite. TestCopyOnWrite?, TestIncrementalWrite?, and TestTransientHTree all still have unit tests which need to be migrated to the htree index structure.

At this point the two main issues with persistence are:

1. handling of overflow pages.

2. incremental eviction can cause pages involved in a mutation to become immutable, triggering an error.

When used with 10 address bits, the htree will not demonstrate these issues for many inputs because each page allows up to 1024 (210) distinct tuples and because the large fan out tends to keep down the number of pages and hence the default retention queue does not cause (2) to be demonstrated.

There are two ways to address (1). The first is to rely on the blob mechanism of the backing store. This will result in large bucket pages when there are many duplicate keys. However, the drawback to this approach is that a large number of duplicates, such as a large number of solutions which fall into the same "group" in a GROUP BY operator, will result in a (potentially very) large page.

It seems that overflow pages could provide a better solution, specifically when we are interested in append only semantics. In this case, once a bucket page consists solely of duplicate keys we would always insert into the head (or tail) of the bucket page chain. This will be a much less expensive operation. When visiting the values in the bucket page chain, we can simply visit each page in turn. Since, by definition, page overflow only occurs when all keys on the page are duplicates, we can safely do both of these optimizations. We can also write the key ONCE per page, for a substantial space savings.

For (2), I believe that we may need to have a "non-eviction set" for the HTree. A page which is participating in a mutation operation would be inserted into that set. Page references would be removed when the operation was complete. The handshaking between this and the eviction strategy will have to be worked out in some detail through an analysis of exactly how operations such as addLevel() can give rise to a page which is undergoing mutation to become immutable.

Committed revision r4948.

Changed 22 months ago by thompsonbry

  • owner changed from thompsonbry to martyncutcher
  • status changed from accepted to assigned

Martyn has the point on the htree now. Per above, we need to coordinate on the "non-eviction set" but he should be good to go on everything else.

Changed 22 months ago by thompsonbry

Bug fix to HTree.lookupAll(), which had not been modified yet to have a single "buddy" for a bucket page.

Added IRawRecordAccess, which is implemented by Leaf and BucketPage?. This interface provides access via the owning index object (B+Tree or HTree) to a raw record via its storage address. Modified AbstractTuple? to use this new interface such that we can resolve raw records on ITupleIterator patterns for the HTree as well as the B+Tree.

Committed revision r4950.

Changed 22 months ago by thompsonbry

Martyn wrote:

I began with reviewing the code and trying to get the slotsPerPage working to make bucketPage sizes independent from the addressBits. I believe this is now okay.

I did have problem with Evictions in testing to higher numbers. An oddity is that with, say 32 slots per page, I hit the eviction error for 1 and 2 bits but also 10 and 12. This seems to indicate that the higher addressBits creates more pages on splits.

I added some simple stats to report the number of created DirectoryPages? and BucketPages?. This reported some interesting stats that I am only beginning to get my head around.

Here is a table for a 2-bit address inserting 1000 entries

bits        slotsOnPage        createdDirectories    createdBuckets
2            32                 22                    77
             64                 14                    61
             128                14                    53

4            32                 10                    107
             64                 10                    75
             128                10                    59

8            32                 7                     107
             64                 7                     75
             128                7                     59

16           32                 2                     107
             64                 2                     75
             128                2                     59

Additional stats indicate that the createdDirectories are all active but half the bucketPages. This fits with the split bucket pattern since we create 2 active for each one we dispose of.

It is clear from the stats that higher bits do NOT create more pages, so while 1 and 2 bits will clearly touch more pages (directories) for each insert, the reason for higher evictions from 10 and 16 bits than 4 and 8 is not obvious.

The eviction rates lead to tests failing on the TestHTree_stressInsert due to forced evictions.

It took me a while to orientate myself with the copyOnWrite code, I was unsure as to whether it was valid to invalidate the read-only versions (by setting childRefs to null), this seemed like it was a problem.

I've fixed some other problems associated with copyOnWrite. Specifically, ensuring that copyOnWrite is called to ensure that mutable versions of DirectoryPages? and BucketPages? are created for insert actions. I also added additional tests to track down calls to deleted pages; from some of the current errors it appears that references are retained to pages deleted from copyOnWrite, so there is more to do there.

I've modified ByteUtil?.getBits to handle bit requests that are out of range of the supplied byte array. This seems to me to make sense, but if you disagree then an alternative getBitsX might provide the extended semantics.

Committed revision r4962.

Changed 22 months ago by thompsonbry

The commit point documented above also includes the changes to the MemoryManager? to support an unbounded sector capacity.

Changed 22 months ago by thompsonbry

A review of the commit message and the code of the htree change set identified the following issues:

- I agree that the copy-on-write model needs to be better understood in the context of the HTree. In particular, the B+Tree copy-on-write never retains a reference to a read-only node which has been made mutable. This makes it possible to "steal" references to the children. This might not be true for the HTree since multiple structural changes can be required when a bucket page is split. Those structural changes can be driven by the reinsert of the tuples in the source bucket page. Since the reinsert occurs as a series of distinct operations using insertRawTuple(), we wind up holding a reference to the original bucket page. However, I am not clear why "stealing" the child references would be a problem since we should not be navigating from the top-down through a directory page which has undergone copy-on-write modification. If we never navigate top-down through a page which has undergone copy-on-write modification, then stealing the child references should be efficient and should not cause problems. Also, the original copy-on-write pattern for the B+Tree proceeds bottom-up once a leaf has been reached which will be modified. It does not invoke copy-on-write on the way down to that leaf. However, some of the HTree changes appear to be doing this. This point should also be talked through to decide whether the approach taken for the B+Tree in this regard is also appropriate to the HTree.

- The "right" pattern for testing whether a PO (Persistent Object) has been deleted is the isDeleted() method.

    void replaceChildRef(final long oldChildAddr, final AbstractPage newChild) {
    ...
            // free the oldChildAddr if the Strategy supports it
            	// - and only if not already deleted!
            	if (npointers == 0)
            		htree.deleteNodeOrLeaf(oldChildAddr);

- There are static createdPages fields on DirectoryPage? and BucketPage?. Statistics should be per-htree instance using either the BTreeCounters class (already available from AbstractHTree) or a new class, ideally sharing a common ancestor or interface with BTreeCounters. The following method signatures on AbstractPage? need to be moved to AbstractHTree or simply turned into fields on a suitable counters class:

    abstract int activeBucketPages();
    abstract int activeDirectoryPages();

Actually, I am confused about what those methods are intended to report. They are using childIterator(). That will force materialization of the children (full tree traversal). This would seem to be something better done as a utility method within the test suite as it is a very expensive operation.

- The modifications to BytesUtil#getBits?() have broken several of the unit tests. I am not sure whether a ZERO return when the starting offset is out of range makes sense. However, I do agree that we need to fix the problem in terms of the htree, which currently will request an N-bit slice, where N is address bits and can be larger than the byte boundary of the key. The unit tests will need to be updated as well.

- slotsOnPage for the BucketPage? should be left as 2addressBits for now. Given that we need to use overflow pages for fast append-only semantics for solution sets in group by operations, it may or may not be effective to specify a different value for slotsOnPage for the BucketPage?. We should explore values for BucketPage?.slotsOnPage when benchmarking, together with values of addressBits and the interaction with the slot sizes available on the backing store.

- I am unclear why this code path in BucketPage#insertRawTuple?() has been disabled:

		// unable to insert
		if (false || globalDepth == htree.addressBits) {
			// max depth so add level
			DirectoryPage np = ((HTree) htree).addLevel2(this);

			np.insertRawTuple(key, val, 0);
		} else {
			// otherwise split page by asking parent to split and re-inserting
			// values

			final DirectoryPage parent = getParentDirectory();
			parent.split(this);
		}

- It is very difficult for me to audit the change set on the HTree class. It seems likely that a method was moved within the file, which is causing difficulties with viewing the delta within eclipse. I'd like to talk through the changes in this file.

The top priorities for the htree remain overflow pages, fixing the issue with eviction (copy-on-write), and ordered keys followed by optimizations. I would like to coordinate closely on the changes to the copy-on-write scheme.

Changed 22 months ago by thompsonbry

We've tracked down the problem with eviction, which is that we were not touching the root DirectoryPage? on the mutation paths. The B+Tree code was written slightly differently and the root was always touched on descent. AbstractHTree#getRoot() has been modified to touch() the root, which solved the problem.

We have resolved some problems with insertRawTuple() where it would not handle overflow pages and where it could duplicate raw records rather than copying their references.

We are currently looking at two stack traces, both of which appear to have a common cause. Basically, we are seeing a DirectoryPage? whose data is read-only but which is not persistent. One thought is that this problem might arise if the various bits of code which scan the childRefs[] and childAddr[] fail to correctly detect and update all references/addresses when the child is replaced by copy-on-write or by a structural transformation, such as splitting a bucket page, adding a new level above a bucket page, or splitting a directory page. We are going to pick this up again tomorrow and examine the state of the HTree surrounding the DirectoryPage? which is tripping the assert. The issue only shows up with the MemStore? in combination with smaller values for address bits (LE 6) and large #s of inserts (100000).

Changed 22 months ago by thompsonbry

Martyn nailed the eviction problem. I had failed to set the persistent identity on the BucketPage? when it was read in.

We have on remaining problem which shows up in the stress tests for large inserts. It seems likely to be a problem with the dirty child iterator failing to visit some dirty children. The children wind up remaining dirty even though their parent has been evicted. Martyn is looking at this one right now.

Changed 22 months ago by thompsonbry

We've nailed the last problem and are now inserting up to 1M tuples into the HTree over the memory store in unit tests w/o problems. Of interest, ordered value scan is extremely fast -- less than 1 second for 1,000,000 values. Random lookup is quite a bit slower, even when the random lookups are performed in key order. This appears to be due to the use of a linear scan of the bucket page when searching for a key. We plan to change the bucket page to be similar to the B+Tree leaf and maintain its keys and values in order (though it must allow duplicate keys). That should speed up the random lookups quite a bit.

The following show the performance characteristics when inserting 1M values. In both cases we insert the sequential integers [0:1M). However, in the 2nd case we insert the hash code of the integer rather than its original bit pattern. Working with the hash code actually speeds up the insert (slightly) and lookupFirst (a little more significantly). It also appears to slightly improve the values() iterator. The first two effects make some sense. I can't quite explain the latter. However, I am also quite certain that ordered inserts would be far more efficient if we were running the test against DISK rather than RAM.

With 1M sequential int32 values:

Inserted: 1000000 tuples in 38875ms, lookupFirst(all)=19250, valueScan(all)=16765, addressBits=1, nnodes=500019, nleaves=500019
Inserted: 1000000 tuples in 13140ms, lookupFirst(all)=7360, valueScan(all)=5875, addressBits=2, nnodes=83343, nleaves=83343
Inserted: 1000000 tuples in 7579ms, lookupFirst(all)=5671, valueScan(all)=3516, addressBits=3, nnodes=35721, nleaves=35721
Inserted: 1000000 tuples in 3844ms, lookupFirst(all)=4859, valueScan(all)=2375, addressBits=4, nnodes=4172, nleaves=4172
Inserted: 1000000 tuples in 3125ms, lookupFirst(all)=6765, valueScan(all)=1954, addressBits=5, nnodes=8069, nleaves=8069
Inserted: 1000000 tuples in 2422ms, lookupFirst(all)=10250, valueScan(all)=1281, addressBits=6, nnodes=3971, nleaves=3971
Inserted: 1000000 tuples in 1907ms, lookupFirst(all)=33203, valueScan(all)=843, addressBits=8, nnodes=18, nleaves=18
Inserted: 1000000 tuples in 2594ms, lookupFirst(all)=124297, valueScan(all)=812, addressBits=10, nnodes=246, nleaves=246

With hash codes of 1M sequential int32 values:

Inserted: 1000000 tuples in 31157ms, lookupFirst(all)=15515, valueScan(all)=13938, addressBits=1, nnodes=500019, nleaves=500019
Inserted: 1000000 tuples in 10657ms, lookupFirst(all)=6031, valueScan(all)=4703, addressBits=2, nnodes=83343, nleaves=83343
Inserted: 1000000 tuples in 5954ms, lookupFirst(all)=4359, valueScan(all)=2750, addressBits=3, nnodes=35721, nleaves=35721
Inserted: 1000000 tuples in 3063ms, lookupFirst(all)=3797, valueScan(all)=1890, addressBits=4, nnodes=4172, nleaves=4172
Inserted: 1000000 tuples in 2515ms, lookupFirst(all)=5016, valueScan(all)=1219, addressBits=5, nnodes=8069, nleaves=8069
Inserted: 1000000 tuples in 2172ms, lookupFirst(all)=7500, valueScan(all)=969, addressBits=6, nnodes=3971, nleaves=3971
Inserted: 1000000 tuples in 1609ms, lookupFirst(all)=26344, valueScan(all)=781, addressBits=8, nnodes=18, nleaves=18
Inserted: 1000000 tuples in 2812ms, lookupFirst(all)=99844, valueScan(all)=547, addressBits=10, nnodes=246, nleaves=246

Changed 22 months ago by thompsonbry

We now have something which works and can be used at scale. Feature and efficiency improvements to follow will include:

- maintaining the keys and values in dense order (time efficiency)
- eliding bucket pages and directory pages which do not have any data (space efficiency)
- support for bucket page overflow (handling large numbers of duplicate keys).

Changed 22 months ago by thompsonbry

Partial test suite cleanup for the HTree, but there are still some unit tests which need to be revisited and fixed. Added the HTree package to CI.

Committed revision r4990.

Changed 22 months ago by thompsonbry

Martyn and I talked through the overflow page design today. We settled on a "BlobPage?" which would be the parent of a chain of "BucketPage?"s. The DirectoryPage? links to a BucketPage? until the bucket page satisfies the overflow criteria (full and all keys are duplicates). At that point we will replace the reference in the parent with a reference to a "BlobPage?" having two children: (1) the original BucketPage?; and (2) a new BucketPage? into which the new insert will be directed. The advantage of this design over chaining pointers is that copy-on-write would cause us to materialize and make mutable all predecessors in the chain. This is not so much a problem for insert as we would always insert at the head of the chain and link backwards to the older bucket pages. However, on remove of a tuple we could be forced to materialize via copy-on-write 1/2 of the bucket pages in the overflow chain on average. The BlobPage? design avoids this problem by storing an array of references/addresses to the child bucket pages. Copy on write of a child bucket page will force the blob page to become mutable (if it is not), but it will not effect the other bucket pages for the same key.

Obviously, there will need to be some shared abstraction between a BlobPage? and a BucketPage?. Presumably, both the blob page and bucket page can implement native HTree methods which operate on a logical bucket page. The blob would have to indirect as appropriate to the child bucket pages. Any given bucket page would just manage its tuples.

Changed 22 months ago by thompsonbry

Support for overflow pages worked out with Martyn. We are good to 5000 overflow pages with a 10 bit address space. We have not yet optimized the keys raba for the overflow bucket pages.

Committed revision r5031.

Changed 21 months ago by thompsonbry

Martyn has continued to push the overflow pages capability, which we expect to leverage quite heavily for temporary solution sets. A few bugs have also been uncovered and fixed.

Changed 21 months ago by thompsonbry

mroycsi wrote: One exception I got now:

Caused by: java.lang.ClassCastException: com.bigdata.htree.MutableBucketData
cannot be cast to com.bigdata.btree.IRawRecordAccess
	at com.bigdata.btree.AbstractTuple.copy(AbstractTuple.java:390)
~[bigdata-terms/:na]
	at
com.bigdata.htree.BucketPage$InnerBucketPageTupleIterator.next(BucketPage.ja
va:821) ~[bigdata-terms/:na]
	at
com.bigdata.htree.BucketPage$InnerBucketPageTupleIterator.next(BucketPage.ja
va:1) ~[bigdata-terms/:na]
	at
cutthecrap.utils.striterators.Expanderator.getNext(Expanderator.java:27)
~[na:na]
	at
cutthecrap.utils.striterators.Expanderator.getNext(Expanderator.java:46)
~[na:na]
	at
cutthecrap.utils.striterators.Prefetch.checkInit(Prefetch.java:12) ~[na:na]
	at cutthecrap.utils.striterators.Prefetch.hasNext(Prefetch.java:20)
~[na:na]
	at
cutthecrap.utils.striterators.Striterator.hasNext(Striterator.java:80)
~[na:na]
	at
com.bigdata.htree.DirectoryPage$4.hasNext(DirectoryPage.java:1853)
~[bigdata-terms/:na]
	at
com.bigdata.bop.join.HashJoinUtility.hashJoin(HashJoinUtility.java:279)
~[bigdata-terms/:na]
	at
com.bigdata.bop.controller.NamedSubqueryIncludeOp$ChunkTask.doHashJoin(Named
SubqueryIncludeOp.java:294) ~[bigdata-terms/:na]
	at
com.bigdata.bop.controller.NamedSubqueryIncludeOp$ChunkTask.call(NamedSubque
ryIncludeOp.java:244) ~[bigdata-terms/:na]
	at
com.bigdata.bop.controller.NamedSubqueryIncludeOp$ChunkTask.call(NamedSubque
ryIncludeOp.java:1) ~[bigdata-terms/:na]
	at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
~[na:1.6.0_26]
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
~[na:1.6.0_26]
	at
com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTask.call(ChunkedRunningQuer
y.java:1186) ~[bigdata-terms/:na]

Martyn, can you please look at this and see whether it is a concurrent eviction problem or a problem with the striterator tail recursion?

Thanks,
bryan

Changed 21 months ago by thompsonbry

As a possible fix I modified the named subquery operator to checkpoint the HTree containing the solution set, reload a read-only view of the HTree from that checkpoint, and then save a reference to the read-only view on the query attributes. This makes the named solution set save for concurrent readers, which is a real possibility if the subquery is included into more than one place in the query.

Committed revision r5116.

Changed 21 months ago by thompsonbry

Ok. The problem is BucketPage??#821. It needs to be changed from:

			tuple.copy(nextNonEmptySlot, data

to this:

			tuple.copy(nextNonEmptySlot, BucketPage.this);

Committed revision r5117.

(I documented this on the wrong ticket).

Changed 21 months ago by thompsonbry

Martyn wrote: Implement ordered insert and iteration for BucketPages?

Committed Revision: r5167

Changed 20 months ago by thompsonbry

Martyn is currently working on delete support. Following that he will take up storage optimizations (elision and lazy allocation of pages).

Changed 20 months ago by thompsonbry

Martyn wrote: Updates to the HTree supporting removal, use of the FrontCodedRaba? for ordered key values and the lazy creation of BucketPages?.

Committed revision r5312.

Changed 20 months ago by thompsonbry

Martyn is going to work through the page ellision logic under removal next. He is also going to work up a demonstration program showing on you can have a very small JVM heap (32M) and a very large HTree instance (as much as you have RAM).

Changed 20 months ago by thompsonbry

The page elision on insert work is done. Martyn is now working on page elision on removal. He is also working on a performance comparison of the HTree versus the Java HashMap? / LinkedHashMap? classes so we can have some sense of what that performance tradeoff looks like.

Changed 19 months ago by thompsonbry

Martyn fixed two bugs pertaining to ensuring that rawRecords array is updated when by insertRawTuple. These bugs were blocking the high volume htree hash join queries.

Changed 19 months ago by thompsonbry

  • status changed from assigned to closed
  • resolution set to fixed

Bug fix dealing with flush as triggered by TestReopen?.

Note: See TracTickets for help on using tickets.