From: <tho...@us...> - 2011-06-28 19:52:07
|
Revision: 4812 http://bigdata.svn.sourceforge.net/bigdata/?rev=4812&view=rev Author: thompsonbry Date: 2011-06-28 19:52:01 +0000 (Tue, 28 Jun 2011) Log Message: ----------- Branch for https://sourceforge.net/apps/trac/bigdata/ticket/342 (Too much IO scatter for TERMS index). This is a branch off of the TERMS_REFACTOR_BRANCH. Here is my summary of what needs to be changed. Mostly, this requires importing some classes which were dropped during the TERMS refactor back in from the QUADS branch. We also have to create a distinction between a TermId, which I think should once again refer to an IV which is a key into the ID2TERM index, and an IV which is a key into the "TERMS" index. I am tempted to rename the TERMS index as the BLOBS index, which would give us a BlobIV for that index. Given that the the TERMS index has too much IO scatter, it this rename seems to be inline with using it primarily for large objects (say 256 bytes and up). The concept of the "NullIV" should go to the ID2TERM' IVs where it can once again be represented by a 0L termId. This is significantly simpler than inserting 4 distinct NullIVs into the TERMS index. Outside of the re-imported classes and the introduction of a BlobIV, most of the changes will be restricted to the LexiconRelation. I plan to go with the introduction of the "extension" byte describe above for the TERMS index. This will provide extensibility for additional storage options, including the inlining of URIs with a combination of a "byte" or "short" namespaceIV and a Unicode localName and specialized storage for block oriented cloud aware storage systems such as Amazon's S3. Given that we will restrict the use of the TERMS index to RDF Values with more than (say) 256 characters of data, this additional byte in the key for the TERMS index will be lost in the noise. Since there will be far fewer entries in the TERMS index, the possibility of a hash collision bucket becoming full is exceedingly remote (the excellent distribution of hash codes is the reason why the TERMS index is causing too much IO) but we could always take one more byte for the counter, making it into a short. That would be an 8 byte TERMS index key, which is what we already have for the ID2TERM index (actually, it's probably 9 bytes since it includes the flags byte as well). {{{ - LexiconRelation: 3 indices (TERM2ID, ID2TERM, TERMS/BLOBS) - LexiconRelation#newAccessPath() will have to support point lookup against ID2TERM and TERMS. - Recover the write tasks, write procedures, and unit tests for the TERM2ID and ID2TERM indices from the QUADS branch. - Update the scale-out data loader (recover the classes which handle the TERM2ID and ID2TERM indices). I will also have to change the latch/guard conditions since both TERM2ID and TERMS are assigning IVs and that is what we need to wait on for various events. - The text index should be ok, but review the integration anyway. - NullIVs should be modeled as a 0L termId for the ID2TERM index rather than using the TERMS index. The TERMS index does not need to have the "NullIV" versions of a URI, BNode, Literal, or SID inserted when it is created. }}} Configuration file settings will also need to be recovered for scale-out for the TERM2ID and ID2TERM indices. Some thought should be given to the buffering of large objects for the TERMS index in the scale-out data loader. Assuming that we choose the threshold for the ID2TERM/TERMS index appropriately, we should observe a moderate amount of activity on that index write pipeline. But very large RDF Values (megabytes) should probably be flushed through immediately. This suggests bounding the asynchronous index write pipeline on memory consumed and not (just) the #of tuples buffered (chunkSize). This branch will be merged back to the TERMS_REFACTOR_BRANCH once it is pass its test suite. Added Paths: ----------- branches/TIDS_PLUS_BLOBS_BRANCH/ branches/TIDS_PLUS_BLOBS_BRANCH/TIDS_PLUS_BLOBS_BRANCH/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |