|
From: Manikandan S. <ma...@vt...> - 2015-06-04 06:42:04
|
Yes, both the tables are distributed by 'id'.
On Thu, 2015-06-04 at 12:08 +0530, Ashutosh Bapat wrote:
>
>
> On Thu, Jun 4, 2015 at 6:38 AM, Abbas Butt
> <abb...@en...> wrote:
>
>
> On Wed, Jun 3, 2015 at 5:10 PM, Manikandan Soundarapandian
> <ma...@vt...> wrote:
> Hi Abbas,
>
> I just looked at the planner for the query that am
> running. This is the query that am running,
>
> SELECT d.col1 RVALUE, ROUND(COUNT(d.col2)/10) COUNT
> FROM Table1 d, Table2 i WHERE d.id=i.id AND i.time
> BETWEEN TRUNC(CURRENT_DATE-TO_DATE('01 Jan 2000','DD
> Mon YYYY')+1-365) AND TRUNC(CURRENT_DATE-TO_DATE('01
> Jan 2000','DD Mon YYYY')) GROUP BY d.col1 ORDER BY
> d.col1 DESC;
>
> And the query plan for a configuration with 4
> datanodes and 1 coordinator,
>
> GroupAggregate (cost=0.07..0.10 rows=1 width=56)
> -> Sort (cost=0.07..0.08 rows=1 width=56)
> Sort Key: d.col1
> -> Hash Join (cost=0.01..0.06 rows=1
> width=56)
> Hash Cond: (d.id = i.id)
>
>
>
> Are both the tables distributed by id?
>
> The clauses with mutable functions like current_date are being pushded
> down to individual relations. That prevents join from being pushed
> down to the datanodes if id is distribution key. I think that should
> be fixed. The clause movement should be reversed to the Join node.
>
>
> -> Data Node Scan on Table1
> "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000
> width=56)
> Node/s: datanode_c2_d1,
> datanode_c3_d1, datanode_c4_d1, datanode_c5_d1
> -> Hash (cost=0.00..0.00 rows=1000
> width=32)
> -> Data Node Scan on Table2
> "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000
> width=32)
> Node/s: datanode_c2_d1,
> datanode_c3_d1, datanode_c4_d1, datanode_c5_d1
> Coordinator quals:
> (((i.time)::double precision <=
> trunc(((('now'::cstring)::date - to_date('01 Jan
> 2000'::text, 'DD Mon YYYY'::text)))::double precisio
> n)) AND ((i.time)::double precision >=
> trunc(((((('now'::cstring)::date - to_date('01 Jan
> 2000'::text, 'DD Mon YYYY'::text)) + 1) -
> 365))::double precision)))
>
>
> Table1 contains 22 million records and Table2 contains
> 10 million records.
> This query took several minutes (almost 35) to get
> results and one third of that time is spent in
> FetchTuple. Do you see something wrong with the query
> plan?
>
>
>
> Can you provide output of
>
> \d+ Table1
> \d+ Table2
>
> I am interested in the distribution column of the tables.
>
>
> Also please provide the output of EXPLAIN VERBOSE on the
> query.
>
>
>
>
> Since the multiplexed IO isn't the real bottleneck
> here, my proposal would be to change the way how
> populating the TupleStore at the coordinator is
> working and introduce parallelism there. Is this a
> right direction? Please let me know your thoughts.
>
> Thanks
> Mani
>
>
> On Wed, 2015-06-03 at 01:49 -0700, Abbas Butt wrote:
> >
> >
> > On Tue, Jun 2, 2015 at 10:17 PM, Manikandan
> > Soundarapandian <ma...@vt...> wrote:
> > Hi Abbas,
> >
> > Thanks for the reply. I understand how
> > pgxc_node_receive is working to
> > get incoming data from the datanodes.
> > Yes, I have made experiments to study the
> > postgres-xc system in a setup
> > with 12 datanodes and a coordinator and
> > performed join, groupby and
> > order by operations on two tables with
> > around 20 million and 10 million
> > records respectively. I have found that
> > around 20-30% of the time is
> > spent in the 'FetchTuple' method which reads
> > into the provided tupleslot
> > one datarow at a time from the combiner's
> > buffer which is in turn filled
> > by the pgxc_node_receive method.
> >
> > XC planner pushes parts of query processing to
> > datanodes to speed things up.
> >
> > In the queries you tried, did the query planner
> > perform this optimization?
> >
> > If not then you can gain advantage by improving the
> > planner to generate optimized plans that avoid
> > UN-necessary materialization at the coordinator.
> >
> > This will bring the FetchTuple time down since there
> > will be less tuples to fetch.
> >
> >
> > If the planner is already producing optimized plans,
> > then what's your proposal?
> >
> >
> >
> >
> > Thanks
> >
> > On Wed, 2015-06-03 at 09:45 +0500, Abbas
> > Butt wrote:
> > >
> > >
> > > On Wed, Jun 3, 2015 at 4:11 AM, Manikandan
> > Soundarapandian
> > > <ma...@vt...> wrote:
> > > Hi,
> > >
> > >
> > > I am a graduate student working on
> > my research in parallel
> > > databases. I would like to know
> > how the postgres-xc
> > > coordinator works. I understand
> > that the datanodes run the
> > > query in parallel and the results
> > are collected by the
> > > coordinator which runs any more
> > computation that is required
> > > or just provides the output to the
> > client that requested the
> > > query. I would like to know
> > whether the coordinator does this
> > > data collection from datanodes in
> > a sequential fashion?
> > >
> > >
> > > The coordinator uses multiplexed IO using
> > select on all fds of
> > > datanodes. For more details please see
> > pgxc_node_receive function in
> > > pgxcnode.c. The loop for reading data on
> > all set fds is sequential,
> > > but the coordinator does not wait for data
> > from the datanode to which
> > > the coordinator had sent the query first.
> > >
> > > For example, lets consider we want
> > to run the query on table
> > > table_x which is hash distributed
> > among 10 datanodes,
> > > select count(*) from table_x;
> > > Each datanode will run the query
> > and give their local counts
> > > and the coordinator has to collect
> > the individual counts and
> > > come up with the final count
> > before sending the output. Is the
> > > data collection process at the
> > coordinator done in a
> > > sequential fashion? I am actually
> > looking to introduce some
> > > kind of parallelism in this data
> > collection if it is
> > > sequential and do performance
> > studies. Please clarify.
> > >
> > >
> > > To improve performance of any system,
> > first study the bottleneck, and
> > > target to widen that. Have you done any
> > study of Postgres-XC to find
> > > where the performance bottleneck is?
> > >
> > >
> > >
> > > --
> > > Thanks
> > > Mani
> > > Department of Computer Science
> > > Virginia Tech
> > >
> > >
> > ------------------------------------------------------------------------------
> > >
> > >
> > _______________________________________________
> > > Postgres-xc-developers mailing
> > list
> > >
> > Pos...@li...
> > >
> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
> > >
> > >
> > >
> > >
> > >
> > > --
> > > --
> > > Abbas
> > > Architect
> > >
> > >
> > > Ph: 92.334.5100153
> > >
> > > Skype ID: gabbasb
> > >
> > > www.enterprisedb.com
> > >
> > > Follow us on Twitter
> > > @EnterpriseDB
> > >
> > > Visit EnterpriseDB for tutorials,
> > webinars, whitepapers and more
> > >
> >
> >
> >
> >
> >
> >
> > --
> > --
> > Abbas
> > Architect
> >
> >
> > Ph: 92.334.5100153
> >
> > Skype ID: gabbasb
> >
> > www.enterprisedb.com
> >
> > Follow us on Twitter
> > @EnterpriseDB
> >
> > Visit EnterpriseDB for tutorials, webinars,
> > whitepapers and more
> >
>
>
>
>
>
> --
> --
> Abbas
> Architect
>
>
> Ph: 92.334.5100153
>
> Skype ID: gabbasb
>
> www.enterprisedb.com
>
> Follow us on Twitter
> @EnterpriseDB
>
> Visit EnterpriseDB for tutorials, webinars, whitepapers and
> more
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Postgres-xc-developers mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers
>
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>
|