From: Manikandan S. <ma...@vt...> - 2015-06-04 06:42:04
|
Yes, both the tables are distributed by 'id'. On Thu, 2015-06-04 at 12:08 +0530, Ashutosh Bapat wrote: > > > On Thu, Jun 4, 2015 at 6:38 AM, Abbas Butt > <abb...@en...> wrote: > > > On Wed, Jun 3, 2015 at 5:10 PM, Manikandan Soundarapandian > <ma...@vt...> wrote: > Hi Abbas, > > I just looked at the planner for the query that am > running. This is the query that am running, > > SELECT d.col1 RVALUE, ROUND(COUNT(d.col2)/10) COUNT > FROM Table1 d, Table2 i WHERE d.id=i.id AND i.time > BETWEEN TRUNC(CURRENT_DATE-TO_DATE('01 Jan 2000','DD > Mon YYYY')+1-365) AND TRUNC(CURRENT_DATE-TO_DATE('01 > Jan 2000','DD Mon YYYY')) GROUP BY d.col1 ORDER BY > d.col1 DESC; > > And the query plan for a configuration with 4 > datanodes and 1 coordinator, > > GroupAggregate (cost=0.07..0.10 rows=1 width=56) > -> Sort (cost=0.07..0.08 rows=1 width=56) > Sort Key: d.col1 > -> Hash Join (cost=0.01..0.06 rows=1 > width=56) > Hash Cond: (d.id = i.id) > > > > Are both the tables distributed by id? > > The clauses with mutable functions like current_date are being pushded > down to individual relations. That prevents join from being pushed > down to the datanodes if id is distribution key. I think that should > be fixed. The clause movement should be reversed to the Join node. > > > -> Data Node Scan on Table1 > "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000 > width=56) > Node/s: datanode_c2_d1, > datanode_c3_d1, datanode_c4_d1, datanode_c5_d1 > -> Hash (cost=0.00..0.00 rows=1000 > width=32) > -> Data Node Scan on Table2 > "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000 > width=32) > Node/s: datanode_c2_d1, > datanode_c3_d1, datanode_c4_d1, datanode_c5_d1 > Coordinator quals: > (((i.time)::double precision <= > trunc(((('now'::cstring)::date - to_date('01 Jan > 2000'::text, 'DD Mon YYYY'::text)))::double precisio > n)) AND ((i.time)::double precision >= > trunc(((((('now'::cstring)::date - to_date('01 Jan > 2000'::text, 'DD Mon YYYY'::text)) + 1) - > 365))::double precision))) > > > Table1 contains 22 million records and Table2 contains > 10 million records. > This query took several minutes (almost 35) to get > results and one third of that time is spent in > FetchTuple. Do you see something wrong with the query > plan? > > > > Can you provide output of > > \d+ Table1 > \d+ Table2 > > I am interested in the distribution column of the tables. > > > Also please provide the output of EXPLAIN VERBOSE on the > query. > > > > > Since the multiplexed IO isn't the real bottleneck > here, my proposal would be to change the way how > populating the TupleStore at the coordinator is > working and introduce parallelism there. Is this a > right direction? Please let me know your thoughts. > > Thanks > Mani > > > On Wed, 2015-06-03 at 01:49 -0700, Abbas Butt wrote: > > > > > > On Tue, Jun 2, 2015 at 10:17 PM, Manikandan > > Soundarapandian <ma...@vt...> wrote: > > Hi Abbas, > > > > Thanks for the reply. I understand how > > pgxc_node_receive is working to > > get incoming data from the datanodes. > > Yes, I have made experiments to study the > > postgres-xc system in a setup > > with 12 datanodes and a coordinator and > > performed join, groupby and > > order by operations on two tables with > > around 20 million and 10 million > > records respectively. I have found that > > around 20-30% of the time is > > spent in the 'FetchTuple' method which reads > > into the provided tupleslot > > one datarow at a time from the combiner's > > buffer which is in turn filled > > by the pgxc_node_receive method. > > > > XC planner pushes parts of query processing to > > datanodes to speed things up. > > > > In the queries you tried, did the query planner > > perform this optimization? > > > > If not then you can gain advantage by improving the > > planner to generate optimized plans that avoid > > UN-necessary materialization at the coordinator. > > > > This will bring the FetchTuple time down since there > > will be less tuples to fetch. > > > > > > If the planner is already producing optimized plans, > > then what's your proposal? > > > > > > > > > > Thanks > > > > On Wed, 2015-06-03 at 09:45 +0500, Abbas > > Butt wrote: > > > > > > > > > On Wed, Jun 3, 2015 at 4:11 AM, Manikandan > > Soundarapandian > > > <ma...@vt...> wrote: > > > Hi, > > > > > > > > > I am a graduate student working on > > my research in parallel > > > databases. I would like to know > > how the postgres-xc > > > coordinator works. I understand > > that the datanodes run the > > > query in parallel and the results > > are collected by the > > > coordinator which runs any more > > computation that is required > > > or just provides the output to the > > client that requested the > > > query. I would like to know > > whether the coordinator does this > > > data collection from datanodes in > > a sequential fashion? > > > > > > > > > The coordinator uses multiplexed IO using > > select on all fds of > > > datanodes. For more details please see > > pgxc_node_receive function in > > > pgxcnode.c. The loop for reading data on > > all set fds is sequential, > > > but the coordinator does not wait for data > > from the datanode to which > > > the coordinator had sent the query first. > > > > > > For example, lets consider we want > > to run the query on table > > > table_x which is hash distributed > > among 10 datanodes, > > > select count(*) from table_x; > > > Each datanode will run the query > > and give their local counts > > > and the coordinator has to collect > > the individual counts and > > > come up with the final count > > before sending the output. Is the > > > data collection process at the > > coordinator done in a > > > sequential fashion? I am actually > > looking to introduce some > > > kind of parallelism in this data > > collection if it is > > > sequential and do performance > > studies. Please clarify. > > > > > > > > > To improve performance of any system, > > first study the bottleneck, and > > > target to widen that. Have you done any > > study of Postgres-XC to find > > > where the performance bottleneck is? > > > > > > > > > > > > -- > > > Thanks > > > Mani > > > Department of Computer Science > > > Virginia Tech > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > > Postgres-xc-developers mailing > > list > > > > > Pos...@li... > > > > > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > > > > > > > > > > > > > > > > > -- > > > -- > > > Abbas > > > Architect > > > > > > > > > Ph: 92.334.5100153 > > > > > > Skype ID: gabbasb > > > > > > www.enterprisedb.com > > > > > > Follow us on Twitter > > > @EnterpriseDB > > > > > > Visit EnterpriseDB for tutorials, > > webinars, whitepapers and more > > > > > > > > > > > > > > > > > -- > > -- > > Abbas > > Architect > > > > > > Ph: 92.334.5100153 > > > > Skype ID: gabbasb > > > > www.enterprisedb.com > > > > Follow us on Twitter > > @EnterpriseDB > > > > Visit EnterpriseDB for tutorials, webinars, > > whitepapers and more > > > > > > > > -- > -- > Abbas > Architect > > > Ph: 92.334.5100153 > > Skype ID: gabbasb > > www.enterprisedb.com > > Follow us on Twitter > @EnterpriseDB > > Visit EnterpriseDB for tutorials, webinars, whitepapers and > more > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > > > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > |