From: Ashutosh B. <ash...@en...> - 2014-03-04 03:50:25
|
On Tue, Mar 4, 2014 at 1:30 AM, David E. Wheeler <da...@ju...>wrote: > On Mar 3, 2014, at 10:36 AM, David E. Wheeler <da...@ju...> > wrote: > > > Even with JOIN clauses added to the query such that it *should* run in > whole on each node, the plan still fetches an entire (very large) table > from each node. I will keep fiddling with the query to see if I can figure > out how to get it to run on the nodes. > > Okay, here's a simplified example: > > =# EXPLAIN VERBOSE > SELECT tr.subscriber_id AS subscriber_id > , s.subscriber AS subscriber_name > , tr.txn_id AS tracking_number > FROM subscriber_:SUB_CODE.transactions tr > LEFT JOIN config.subscribers s ON tr.subscriber_id = s.subscriber_id > WHERE tr.txn_timestamp_utc >= :'RPT_DAY'::timestamp - (:DAYS_BACK || ' > days')::interval + (:CUTOFF_HOUR || ' hours')::interval > AND tr.txn_timestamp_utc< :'RPT_DAY'::timestamp + (:CUTOFF_HOUR || ' > hours')::interval > ; > > QUERY > PLAN > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Hash Right Join (cost=0.01..0.06 rows=1 width=44) > Output: tr.subscriber_id, s.subscriber, tr.txn_id > Hash Cond: (s.subscriber_id = tr.subscriber_id) > -> Data Node Scan on subscribers "_REMOTE_TABLE_QUERY_" > (cost=0.00..0.00 rows=1000 width=36) > Output: s.subscriber, s.subscriber_id > Node/s: rptdata01 > Remote query: SELECT subscriber, subscriber_id FROM ONLY > config.subscribers s WHERE true > -> Hash (cost=0.00..0.00 rows=1000 width=12) > Output: tr.subscriber_id, tr.txn_id > -> Data Node Scan on transactions "_REMOTE_TABLE_QUERY_" > (cost=0.00..0.00 rows=1000 width=12) > Output: tr.subscriber_id, tr.txn_id > Node/s: rptdata01, rptdata02, rptdata03, rptdata04 > Remote query: SELECT subscriber_id, txn_id, > txn_timestamp_utc FROM ONLY subscriber_482900.transactions tr WHERE true > Coordinator quals: ((tr.txn_timestamp_utc < ('2014-01-16 > 00:00:00'::timestamp without time zone + ('0 hours'::cstring)::interval)) > AND (tr.txn_timestamp_utc >= (('2014-01-16 00:00:00'::timestamp without > time zone - ('1 days'::cstring)::interval) + ('0 > hours'::cstring)::interval))) > > This line is the problem: > > Remote query: SELECT subscriber_id, txn_id, txn_timestamp_utc FROM > ONLY subscriber_482900.transactions tr WHERE true > > Why aren't the WHERE clauses getting pushed down to the data nodes? That > transactions table is very large, and there is an index on > txn_timestamp_utc, so I can see no reason why the planner would choose to > do a full table scan on every data node and filter on the coordinator. > That's a really poor choice. This forced table scan is eating all the > memory, of course. > > Is there a reason the planner isn't pushing down the WHERE clauses? > > Although, it looks like the quals are shippable, there may be something which is being deemed as volatile (function or operator or a cast) or some parameter. Unfortunately, I don't have time to dig through it. Can you please check what happens to the JOIN if you say removing WHERE clause. With that the entire join should be shippable, assuming that the two relations are distributed on subscriber_id. > Thanks, > > David > > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company |