Thread: [Postgres-xc-developers] shipping outer joins

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All,
Currently, we do not ship OUTER join to the datanodes when both the sides
of the join are not replicated. But there are cases, where shipping OUTER
join between distributed/replicated tables should be possible. Those cases
are as stated below. For every case, find a justification or proof of
correctness for the said shippability.

1. An equi-outer-join between the distribution columns of two distributed
tables such that the tables are distributed on same set of nodes and
distribution strategy is same and datatype of distribution column is same,
is shippable to the datanodes where the distributed tables are distributed.
justification for the claim
---------------------------------
Any outer join A OJ B between relations A and B is defined as A IJ B (term
I) + rows from A which are not part of I (term II), with NULL values for
columns of B + rows of B which are not part of I (term III), with NULL
values for columns of A. A IJ B is inner join between A and B. I will
prefix n to OJ, IJ, or terms I, II, III to mean the operations executed on
nth node.

In such case, rows with same value for distribution column reside on the
same datanode for both the tables. Hence, a given row on a given node from
either table can not join with a row of other table from any node other
than where it resides (referred as (a)). Hence if we collect A IJn B across
all the nodes, it produces A IJ B. Because of (a), a row which is part of
IIn and IIIn, will also be part of II and III resp. Since a row resides
only on a single node, a row r1 which is part of IIn can not be part of IIm
(n != m). Similarly for any row in IIIn. Thus if we collect IIn and IIIn
across all nodes, it produces II and III respectively. Now In + IIn + IIIn
is nothing but A OJn B. Hence if we collect A OJn B from all the nodes, it
produces A OJ B. Hence the above result.

2. An equi-outer-join between a distributed and a replicated relation is
shippable to the datanodes where distributed relation is distributed if the
replicated relation is replicated on those nodes and the outer side of join
is distributed relation.

justification
---------------
Any left outer join A LOJ B between relations A and B is defined as A IJ B
(term I) + rows from A which are not part of I (term II). A IJ B is inner
join between A and B. I will prefix n (or m) to LOJ, IJ, or terms I, II to
mean the operations executed on nth (or mth) node. In this case, A is
distributed and B is replicated.

Since all the rows of B are available on every node where A is distributed,
join of a given row of A on a given node with rows of B can be evaluated on
that node (a). Since a given row of A exists only on a single node (b), A
IJn B intersection A IJm B (n != m) should be NULL. Thus if we collect A
IJn B from all nodes, it will produce A IJ B. Because of (a) and (b), a row
of A on a given node n, which is not part of A IJn B, will be part of IIn
as well as II and can not be part of IIm (m != n). Thus if we collect all
IIn, it would produce II. Since A IJn B + IIn is A LOJn B, we can collect A
LOJn B from all the nodes to produce A LOJ B. Hence the claim above.

Does anybody see holes in those arguments?
-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company

Thread: [Postgres-xc-developers] shipping outer joins

postgres-xc-developers