|
From: Mike P. <mi...@sy...> - 2010-08-27 18:53:44
|
You would definitely get better locality by using POS instead of SPO in that case since the P and O remain constant for every asBound version of the (?c, d, e) tail. However, just playing devil's advocate, we've had the argument before over whether locality or concurrency wins in scale-out. For example, recall the discussion we had over forward or reverse properties (excerpted below). I argued that using forward properties would always win because you'd have better locality. You argued that the locality would be offset by increased parallelism from using reverse properties. How is this argument any different? Is it because we are talking about only one tail instead of many tails?
...
The case where you were gathering multiple properties for A instead of just one is a little different:
Forward:
select ?a,?b,?c,?d
where {
?a type A .
?a AtoB ?b .
?a AtoC ?c .
?a AtoD ?d .
}
Reverse:
select ?a,?b,?c,?d
where {
?a type A .
?b BtoA ?a .
?c CtoA ?a .
?d DtoA ?a .
}
In the case of forward you'd visit the POS index and then all the properties for a given ?a binding would be in the same place for the remainder of the tails (same SPO shard). For reverse you'd visit the POS index and then you'd have a set of bound PO tuples that would need to be visited using POS shards. Again, unlikely they'd be in the same place, which means network movement to complete the join. I assert that forward will always beat reverse in this case because you have all the data you need for tails 2-4 locally. Bryan asserts that reverse can overcome the lack of locality via concurrency.
-----Original Message-----
From: Bryan Thompson
Sent: Friday, August 27, 2010 12:42 PM
To: Mike Personick
Cc: big...@li...
Subject: chosing the index for testing fully bound access paths based on index locality
Mike,
David and I talked through a scenario where we might benefit by choosing a non-SPO index when testing a fully bound triple. Consider:
query :- (a b ?c) AND (?c, d, e).
Assuming that (a b ?c) is more selective, we would normally choose the SPO index for both access paths. However, the POS index will have better locality for (?c d e), so perhaps we would do better by sending the binding sets to that index?
The guiding priciple would be, "when fully bound, choose the index with better locality based on the variable(s) in the triple pattern."
If you think this makes sense, let's file and issue for this. I would have to review LUBM/BSBM to be certain, but I would not be suprised if both of those benchmarks included queries which had the same characteristic. As the number of variables in the second triple pattern increases, there will be less locality in the index so this might work better for one unbound triple patterns than for two unbound triple patterns.
Thanks,
Bryan
|