From: Ashutosh B. <ash...@en...> - 2015-06-04 06:38:24
|
On Thu, Jun 4, 2015 at 6:38 AM, Abbas Butt <abb...@en...> wrote: > > > On Wed, Jun 3, 2015 at 5:10 PM, Manikandan Soundarapandian <ma...@vt...> > wrote: > >> Hi Abbas, >> >> I just looked at the planner for the query that am running. This is the >> query that am running, >> >> *SELECT d.col1 RVALUE, ROUND(COUNT(d.col2)/10) COUNT FROM Table1 d, >> Table2 i WHERE d.id <http://d.id>=i.id <http://i.id> AND i.time BETWEEN >> TRUNC(CURRENT_DATE-TO_DATE('01 Jan 2000','DD Mon YYYY')+1-365) AND >> TRUNC(CURRENT_DATE-TO_DATE('01 Jan 2000','DD Mon YYYY')) GROUP BY d.col1 >> ORDER BY d.col1 DESC;* >> >> And the query plan for a configuration with 4 datanodes and 1 coordinator, >> >> *GroupAggregate (cost=0.07..0.10 rows=1 width=56)* >> * -> Sort (cost=0.07..0.08 rows=1 width=56)* >> * Sort Key: d.col1* >> * -> Hash Join (cost=0.01..0.06 rows=1 width=56)* >> * Hash Cond: (d.id <http://d.id> = i.id <http://i.id>)* >> > Are both the tables distributed by id? The clauses with mutable functions like current_date are being pushded down to individual relations. That prevents join from being pushed down to the datanodes if id is distribution key. I think that should be fixed. The clause movement should be reversed to the Join node. > * -> Data Node Scan on Table1 "_REMOTE_TABLE_QUERY_" >> (cost=0.00..0.00 rows=1000 width=56)* >> * Node/s: datanode_c2_d1, datanode_c3_d1, >> datanode_c4_d1, datanode_c5_d1* >> * -> Hash (cost=0.00..0.00 rows=1000 width=32)* >> * -> Data Node Scan on Table2 >> "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000 width=32)* >> * Node/s: datanode_c2_d1, datanode_c3_d1, >> datanode_c4_d1, datanode_c5_d1* >> * Coordinator quals: (((i.time)::double >> precision <= trunc(((('now'::cstring)::date - to_date('01 Jan 2000'::text, >> 'DD Mon YYYY'::text)))::double precisio* >> *n)) AND ((i.time)::double precision >= trunc(((((('now'::cstring)::date >> - to_date('01 Jan 2000'::text, 'DD Mon YYYY'::text)) + 1) - 365))::double >> precision)))* >> >> >> Table1 contains 22 million records and Table2 contains 10 million records. >> This query took several minutes (almost 35) to get results and one third >> of that time is spent in FetchTuple. Do you see something wrong with the >> query plan? >> > > Can you provide output of > \d+ Table1 > \d+ Table2 > I am interested in the distribution column of the tables. > > Also please provide the output of EXPLAIN VERBOSE on the query. > > > >> Since the multiplexed IO isn't the real bottleneck here, my proposal >> would be to change the way how populating the TupleStore at the coordinator >> is working and introduce parallelism there. Is this a right direction? >> Please let me know your thoughts. >> >> Thanks >> Mani >> >> >> On Wed, 2015-06-03 at 01:49 -0700, Abbas Butt wrote: >> >> >> >> On Tue, Jun 2, 2015 at 10:17 PM, Manikandan Soundarapandian < >> ma...@vt...> wrote: >> >> Hi Abbas, >> >> Thanks for the reply. I understand how pgxc_node_receive is working to >> get incoming data from the datanodes. >> Yes, I have made experiments to study the postgres-xc system in a setup >> with 12 datanodes and a coordinator and performed join, groupby and >> order by operations on two tables with around 20 million and 10 million >> records respectively. I have found that around 20-30% of the time is >> spent in the 'FetchTuple' method which reads into the provided tupleslot >> one datarow at a time from the combiner's buffer which is in turn filled >> by the pgxc_node_receive method. >> >> >> XC planner pushes parts of query processing to datanodes to speed >> things up. >> >> In the queries you tried, did the query planner perform this >> optimization? >> >> If not then you can gain advantage by improving the planner to generate >> optimized plans that avoid UN-necessary materialization at the coordinator. >> >> This will bring the FetchTuple time down since there will be less >> tuples to fetch. >> >> >> If the planner is already producing optimized plans, then what's your >> proposal? >> >> >> >> >> Thanks >> >> >> On Wed, 2015-06-03 at 09:45 +0500, Abbas Butt wrote: >> > >> > >> > On Wed, Jun 3, 2015 at 4:11 AM, Manikandan Soundarapandian >> > <ma...@vt...> wrote: >> > Hi, >> > >> > >> > I am a graduate student working on my research in parallel >> > databases. I would like to know how the postgres-xc >> > coordinator works. I understand that the datanodes run the >> > query in parallel and the results are collected by the >> > coordinator which runs any more computation that is required >> > or just provides the output to the client that requested the >> > query. I would like to know whether the coordinator does this >> > data collection from datanodes in a sequential fashion? >> > >> > >> > The coordinator uses multiplexed IO using select on all fds of >> > datanodes. For more details please see pgxc_node_receive function in >> > pgxcnode.c. The loop for reading data on all set fds is sequential, >> > but the coordinator does not wait for data from the datanode to which >> > the coordinator had sent the query first. >> > >> > For example, lets consider we want to run the query on table >> > table_x which is hash distributed among 10 datanodes, >> > select count(*) from table_x; >> > Each datanode will run the query and give their local counts >> > and the coordinator has to collect the individual counts and >> > come up with the final count before sending the output. Is the >> > data collection process at the coordinator done in a >> > sequential fashion? I am actually looking to introduce some >> > kind of parallelism in this data collection if it is >> > sequential and do performance studies. Please clarify. >> > >> > >> > To improve performance of any system, first study the bottleneck, and >> > target to widen that. Have you done any study of Postgres-XC to find >> > where the performance bottleneck is? >> > >> > >> > >> > -- >> > Thanks >> > Mani >> > Department of Computer Science >> > Virginia Tech >> > >> > >> ------------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > Postgres-xc-developers mailing list >> > Pos...@li... >> > >> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers >> > >> > >> > >> > >> > >> > -- >> > -- >> > Abbas >> > Architect >> > >> > >> > Ph: 92.334.5100153 >> > >> > Skype ID: gabbasb >> > >> > www.enterprisedb.com >> > >> > Follow us on Twitter >> > @EnterpriseDB >> > >> > Visit EnterpriseDB for tutorials, webinars, whitepapers and more >> > >> >> >> >> >> >> >> -- >> >> -- >> *Abbas* >> >> Architect >> >> >> Ph: 92.334.5100153 >> >> Skype ID: gabbasb >> >> www.enterprisedb.co <http://www.enterprisedb.com/>m >> <http://www.enterprisedb.com/> >> >> *Follow us on Twitter* >> @EnterpriseDB >> >> Visit EnterpriseDB for tutorials, webinars, whitepapers >> <http://www.enterprisedb.com/resources-community> and more >> <http://www.enterprisedb.com/resources-community> >> >> >> > > > -- > -- > *Abbas* > Architect > > Ph: 92.334.5100153 > Skype ID: gabbasb > www.enterprisedb.co <http://www.enterprisedb.com/>m > <http://www.enterprisedb.com/> > > *Follow us on Twitter* > @EnterpriseDB > > Visit EnterpriseDB for tutorials, webinars, whitepapers > <http://www.enterprisedb.com/resources-community> and more > <http://www.enterprisedb.com/resources-community> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Postgres-xc-developers mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers > > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company |