Hello, I have installed postgres-xl on 2 servers: gtm + coord1 + datanode1 on server A, and coord2 + datanode2 on server B. I have 2 fairly large tables distributed by hash(c) and a fairly small master data table that is replicated to all nodes.
In my first example: you can see the problem: I've joined 2 very large tables by their distribution key, but the explain plan definitely shows pg-xl redistributing both of the tables by hash(c) again (even though it was already distributed by hash(c) ). I thought the ability to push-down co-distributed joins without re-distributing was a big difference between pg-xc and pg-xl:
[local]:5432 pgxl = # explain select sum(my_data1.b) from my_data1, my_data2 **where my_data1.c = my_data2.c**; QUERY PLAN Aggregate (cost=20580574.24..20580574.25 rows=1 width=4) -> Remote Subquery Scan on all (dn1,dn2) (cost=3364881.84..20330900.00 rows=99869696 width=4) -> Aggregate (cost=3364881.84..20330900.00 rows=1 width=4) -> Hash Join (cost=3364881.84..20330900.00 rows=99869696 width=4) Hash Cond: ((my_data2.c)::text = (my_data1.c)::text) -> **Remote Subquery Scan on all** (dn1,dn2) (cost=100.00..3064213.62 rows=99869696 width=6) **Distribute results by H: c** -> Seq Scan on my_data2 (cost=0.00..1965546.96 rows=99869696 width=6) -> Hash (cost=3326710.21..3326710.21 rows=99974304 width=12) -> **Remote Subquery Scan on all** (dn1,dn2) (cost=100.00..3326710.21 rows=99974304 width=12) **Distribute results by H: c** -> Seq Scan on my_data1 (cost=0.00..1627047.04 rows=99974304 width=12)
In my second example, you'll notice the same explain plan as above, even though the 2nd query is not joining on the distribution key:
[local]:5432 pgxl = # explain select sum(my_data1.b) from my_data1, my_data2 where my_data1.b = my_data2.b; QUERY PLAN Aggregate (cost=20385312.24..20385312.25 rows=1 width=4) -> Remote Subquery Scan on all (dn1,dn2) (cost=3267250.84..20135638.00 rows=99869696 width=4) -> Aggregate (cost=3267250.84..20135638.00 rows=1 width=4) -> Hash Join (cost=3267250.84..20135638.00 rows=99869696 width=4) Hash Cond: (my_data2.b = my_data1.b) ->** Remote Subquery Scan on all (dn1,dn2)** (cost=100.00..2864474.22 rows=99869696 width=4) **Distribute results by H: b** -> Seq Scan on my_data2 (cost=0.00..1965546.96 rows=99869696 width=4) -> Hash (cost=2526915.78..2526915.78 rows=99974304 width=4) -> **Remote Subquery Scan on all** (dn1,dn2) (cost=100.00..2526915.78 rows=99974304 width=4) **Distribute results by H: b** -> Seq Scan on my_data1 (cost=0.00..1627047.04 rows=99974304 width=4) (12 rows)
In my third example, you can see pg-xl doing a great job of joining to a replicated table. There is no re-distribution of the data.
[local]:5432 pgxl = # explain select sum(b) from my_data1, master_data where my_data1.a = master_data.a; QUERY PLAN Aggregate (cost=3251630.55..3251630.56 rows=1 width=4) -> Remote Subquery Scan on all (dn1,dn2) (cost=1.07..3001694.79 rows=99974304 width=4) -> Aggregate (cost=1.07..3001694.79 rows=1 width=4) -> Hash Join (cost=1.07..3001694.79 rows=99974304 width=4) Hash Cond: ((my_data1.a)::text = (master_data.a)::text) -> Seq Scan on my_data1 (cost=0.00..1627047.04 rows=99974304 width=6) -> Hash (cost=1.03..1.03 rows=3 width=2) -> Seq Scan on master_data (cost=0.00..1.03 rows=3 width=2)
Is there a way to force pg-xl to join co-distributed tables without re-distributing the data?
Thank you,
Bob
Sorry, I forgot to mention that the 2nd and 3rd examples above have the good-looking explain plans that I expected. But I expected the 1st explain plan (co-distributed join) to look more like the 3rd explain plan (join to a replicated table) because each node has all of the data it needs to perform the join without re-distribution. Unfortunately, the 1st explain plan (co-distributed join) is identical to the 2nd explain plan (join on non-distributed field).
My expectation is based on this quote from the pg-xl documentation:
Otherwise, I really do like the product. I just need to figure out this one piece.