[Bigdata-commit] SF.net SVN: bigdata:[6356] branches/BIGDATA_RELEASE_1_2_0

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 6356
          http://bigdata.svn.sourceforge.net/bigdata/?rev=6356&view=rev
Author:   thompsonbry
Date:     2012-06-25 15:41:02 +0000 (Mon, 25 Jun 2012)
Log Message:
-----------
Mike has been reporting significant performance improvements associated with the use of the POS index for point tests on fully bound access paths for triple patterns such as "P??" and "PO?".  The rationale for improved performance is that the index accesses are concentrated in the same region of the index because the P is a constant in this triple pattern.  

Historically, we have been using SPO (or SPOC for quads) whenever the predicate became fully bound during query evaluation (as a result of the propagation of variable bindings from solutions).  After talking this through with Mike, we have decided to change the code in SPOKeyOrder#getKeyOrder()/2 to use the index associated with the original triple pattern rather than the SPO/SPOC index.

The ASTStaticJoinOptimizer currently annotations the predicates with AST2BOpBase.Annotations.ORIGINAL_INDEX.  This is the index associated with the original triple or quad pattern without regard to any propagation of variable bindings.  Therefore, I have modified SPOKeyOrder#getKeyOrder()/2 to use this ORIGINAL_INDEX annotation (if present) when the predicate is fully bound during evaluation and to default to the SPO / SPOC index if the predicate is fully bound by the ORIGINAL_INDEX annotation is not present.

The only drawback with this approach is that the ORIGINAL_INDEX annotation is not present if the static join order optimizer is disabled.  However, the ORIGINAL_INDEX annotation COULD be lifted into another ASTOptimizer in order to guarantee that it is always predicate.

Since point tests can now occur against any statement index, I also modified SPORelation#getStatementIndexMetadata() to enable the bloom filter for ALL statement indices, not just SPO.  Also, through an oversight, we were not enabling the bloom filter for ANY of the statement indices in quads mode.  It is now enabled it both triples and quads modes.

Changes to SPOKeyOrder#getKeyOrder()/2 and SPORelation#getStatementIndexMetadata().

@see https://sourceforge.net/apps/trac/bigdata/ticket/150 (choosing the index for testing fully bound access paths based on index locality)

Modified Paths:
--------------
    branches/BIGDATA_RELEASE_1_2_0/bigdata/src/java/com/bigdata/bop/IPredicate.java
    branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPOKeyOrder.java
    branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPORelation.java

Modified: branches/BIGDATA_RELEASE_1_2_0/bigdata/src/java/com/bigdata/bop/IPredicate.java
===================================================================

--- branches/BIGDATA_RELEASE_1_2_0/bigdata/src/java/com/bigdata/bop/IPredicate.java	2012-06-22 18:23:43 UTC (rev 6355)
+++ branches/BIGDATA_RELEASE_1_2_0/bigdata/src/java/com/bigdata/bop/IPredicate.java	2012-06-25 15:41:02 UTC (rev 6356)
@@ -500,7 +500,7 @@
      * query optimizer.
      * 
      * @return The assigned {@link IKeyOrder} or <code>null</code> if the
-     *         {@link IKeyOrder} was not overriden.
+     *         {@link IKeyOrder} was not overridden.
      * 
      * @see Annotations#KEY_ORDER
      */

Modified: branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPOKeyOrder.java
===================================================================
--- branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPOKeyOrder.java	2012-06-22 18:23:43 UTC (rev 6355)
+++ branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPOKeyOrder.java	2012-06-25 15:41:02 UTC (rev 6356)
@@ -46,7 +46,9 @@
 import com.bigdata.rdf.internal.IVUtility;
 import com.bigdata.rdf.internal.constraints.RangeBOp;
 import com.bigdata.rdf.model.StatementEnum;
+import com.bigdata.rdf.sparql.ast.eval.AST2BOpBase;
 import com.bigdata.striterator.AbstractKeyOrder;
+import com.bigdata.striterator.IKeyOrder;
 
 /**
  * Represents the key order used by an index for a triple relation.
@@ -876,6 +878,10 @@
      *       the keyArity parameter. That parameter is only there because we
      *       support two distinct families of natural orders in this class: one
      *       for triples and one for quads.
+     * 
+     * @see <a href="https://sourceforge.net/apps/trac/bigdata/ticket/150" >
+     *      Choosing the index for testing fully bound access paths based on
+     *      index locality</a>
      */
     static public SPOKeyOrder getKeyOrder(final IPredicate<ISPO> predicate,
             final int keyArity) {
@@ -906,7 +912,24 @@
             // Note: Context is ignored!
 
             if (s && p && o) {
-                return SPO;
+                
+                /*
+                 * If the access path is all bound, then we want to use the
+                 * index associated with the original predicate (which typically
+                 * had variables in one or more positions). This index will have
+                 * better locality since it will naturally group the index
+                 * accesses in the same region of the index.
+                 * 
+                 * @see https://sourceforge.net/apps/trac/bigdata/ticket/150
+                 * (chosing the index for testing fully bound access paths based
+                 * on index locality)
+                 */
+
+                final SPOKeyOrder tmp = predicate.getProperty(
+                        AST2BOpBase.Annotations.ORIGINAL_INDEX, SPO);
+                
+                return tmp; // SPO
+                
             } else if (s && p) {
                 return SPO;
             } else if (s && o) {
@@ -956,8 +979,23 @@
                 return SOPC;
             }
 
-            return SPOC;
+            /*
+             * If the access path is all bound, then we want to use the
+             * index associated with the original predicate (which typically
+             * had variables in one or more positions). This index will have
+             * better locality since it will naturally group the index
+             * accesses in the same region of the index.
+             * 
+             * @see https://sourceforge.net/apps/trac/bigdata/ticket/150
+             * (chosing the index for testing fully bound access paths based
+             * on index locality)
+             */
 
+            final SPOKeyOrder tmp = predicate.getProperty(
+                    AST2BOpBase.Annotations.ORIGINAL_INDEX, SPOC);
+
+            return tmp;// SPOC;
+
         }
 
     }

Modified: branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPORelation.java
===================================================================
--- branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPORelation.java	2012-06-22 18:23:43 UTC (rev 6355)
+++ branches/BIGDATA_RELEASE_1_2_0/bigdata-rdf/src/java/com/bigdata/rdf/spo/SPORelation.java	2012-06-25 15:41:02 UTC (rev 6356)
@@ -652,6 +652,10 @@
     
     /**
      * Overrides for the statement indices.
+     * 
+     * @see <a href="https://sourceforge.net/apps/trac/bigdata/ticket/150" >
+     *      Choosing the index for testing fully bound access paths based on
+     *      index locality</a>
      */
     protected IndexMetadata getStatementIndexMetadata(final SPOKeyOrder keyOrder) {
 
@@ -714,19 +718,48 @@
 
         }
 
-        if (bloomFilter && keyOrder.equals(SPOKeyOrder.SPO)) {
+        if (bloomFilter) {// && keyOrder.equals(SPOKeyOrder.SPO)) {
             
+//          * Enable the bloom filter for the SPO index only.
+//          * 
+//          * Note: This SPO index is used any time we have an access path that
+//          * is a point test. Therefore this is the only index for which it
+//          * makes sense to maintain a bloom filter.
+            
             /*
-             * Enable the bloom filter for the SPO index only.
+             * Enable the bloom filter.
              * 
-             * Note: This SPO index is used any time we have an access path that
-             * is a point test. Therefore this is the only index for which it
-             * makes sense to maintain a bloom filter.
+             * Historically, the bloom filter was only enabled in for the SPO
+             * index because the SPO index was always used for a point test.
+             * Further, by oversight, it was never enabled for a quads mode KB
+             * instance. However, in order to improve the locality of reference
+             * for point tests, the SPOKeyOrder#getKeyOrder() method was
+             * modified to use the index which would be used if we evaluated the
+             * original predicate rather than the "as-bound" predicate. This
+             * change provides better locality of access since the variables
+             * indicate the region of variability in the index while the
+             * constants indicate the region of locality. Therefore, fully bound
+             * (aka point tests) can now occur on ANY statement index and the
+             * bloom filter is now enabled on all statement indices.
              * 
+             * @see https://sourceforge.net/apps/trac/bigdata/ticket/150
+             * (chosing the index for testing fully bound access paths based on
+             * index locality)
+             */
+            
+            /*
              * Note: The maximum error rate (maxP) applies to the mutable BTree
              * only. For scale-out indices, there is one mutable BTree per index
              * partition and a new (empty) BTree is allocated each time the live
              * journal for the index partitions overflows.
+             * 
+             * Note: The bloom filter is disabled once the error rate grows too
+             * large. However, in scale-out, we are able to put a perfect fit
+             * bloom filter in each index segment. This can significantly reduce
+             * the IO costs associated with point tests. This can not be done in
+             * the single machine database mode because the error rate of the
+             * bloom filter is a function of the size of the bloom filter and
+             * the #of entries in the index.
              */
 
 //            // good performance up to ~2M triples.

This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.





[Bigdata-commit] SF.net SVN: bigdata:[6356] branches/BIGDATA_RELEASE_1_2_0

Fast, scalable, robust graph database platform

[Bigdata-commit] SF.net SVN: bigdata:[6356] branches/BIGDATA_RELEASE_1_2_0