|
From: Nick M. <nm...@gm...> - 2012-08-22 19:22:02
|
All,
I am trying to run the following query which never finishes running:
SELECT sub_list.*
FROM sub_list
JOIN list AS listquery
ON listquery.sub_list_id = sub_list.sub_list_id
JOIN parent AS parentquery
ON parentquery.list_id = listquery.list_id
WHERE parentquery.time > 20000 AND
parentquery.time < 30000;
Please excuse my dumb table and column names this is just an example with
the same structure as a real query of mine.
CREATE TABLE parent (
name text,
time bigint,
list_id bigserial
);
CREATE TABLE list (
list_id bigint,
name text,
time bigint,
sub_list_id bigserial
);
CREATE TABLE sub_list (
sub_list_id bigint,
element_name text,
element_value numeric,
time bigint
);
I have tried rewriting the same query in the following ways, none of them
finish running:
SELECT sub_list.*
FROM sub_list,
(SELECT list.sub_list_id
FROM list,
(SELECT list_id
FROM parent
WHERE time > 20000 AND
time < 30000
) AS parentquery
WHERE list.list_id = parentquery.list_id
) AS listquery
WHERE sub_list.sub_list_id = listquery.sub_list_id;
SELECT sub_list.*
FROM sub_list
WHERE sub_list_id IN (SELECT sub_list_id
FROM list
WHERE list_id IN (SELECT list_id
FROM parent
WHERE time > 20000 AND
time < 30000);
By "not finishing running" i mean that i run them in psql and they hang
there, i notice that my datanode processes (postgres process name) are
pegged at 100% CPU usage, but iotop doesn't show any disk activity. When i
try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C out.
My parent table has 1 million rows, my list table has 10 million rows, and
my sub_list_table has 100 million rows.
I am able to run all of these queries just fine on regular Postgres with
the same amount of data in the tables.
Any help is greatly appreciated,
Nick
|
|
From: Magorn <ma...@gm...> - 2012-08-22 20:35:50
|
Hi, Do you have indexes on your tables ? Can you show the explain plan with your regular database ? Regards, On Wed, Aug 22, 2012 at 9:21 PM, Nick Maludy <nm...@gm...> wrote: > All, > > I am trying to run the following query which never finishes running: > > SELECT sub_list.* > FROM sub_list > JOIN list AS listquery > ON listquery.sub_list_id = sub_list.sub_list_id > JOIN parent AS parentquery > ON parentquery.list_id = listquery.list_id > WHERE parentquery.time > 20000 AND > parentquery.time < 30000; > > Please excuse my dumb table and column names this is just an example with > the same structure as a real query of mine. > > CREATE TABLE parent ( > name text, > time bigint, > list_id bigserial > ); > > CREATE TABLE list ( > list_id bigint, > name text, > time bigint, > sub_list_id bigserial > ); > > CREATE TABLE sub_list ( > sub_list_id bigint, > element_name text, > element_value numeric, > time bigint > ); > > I have tried rewriting the same query in the following ways, none of them > finish running: > > SELECT sub_list.* > FROM sub_list, > (SELECT list.sub_list_id > FROM list, > (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000 > ) AS parentquery > WHERE list.list_id = parentquery.list_id > ) AS listquery > WHERE sub_list.sub_list_id = listquery.sub_list_id; > > SELECT sub_list.* > FROM sub_list > WHERE sub_list_id IN (SELECT sub_list_id > FROM list > WHERE list_id IN (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000); > > By "not finishing running" i mean that i run them in psql and they hang > there, i notice that my datanode processes (postgres process name) are > pegged at 100% CPU usage, but iotop doesn't show any disk activity. When i > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C out. > > My parent table has 1 million rows, my list table has 10 million rows, and > my sub_list_table has 100 million rows. > > I am able to run all of these queries just fine on regular Postgres with > the same amount of data in the tables. > > Any help is greatly appreciated, > Nick > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > > -- Magorn |
|
From: Mason S. <ma...@st...> - 2012-08-22 20:53:44
|
On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> wrote: > All, > > I am trying to run the following query which never finishes running: > > SELECT sub_list.* > FROM sub_list > JOIN list AS listquery > ON listquery.sub_list_id = sub_list.sub_list_id > JOIN parent AS parentquery > ON parentquery.list_id = listquery.list_id > WHERE parentquery.time > 20000 AND > parentquery.time < 30000; > > Please excuse my dumb table and column names this is just an example with > the same structure as a real query of mine. > > CREATE TABLE parent ( > name text, > time bigint, > list_id bigserial > ); > > CREATE TABLE list ( > list_id bigint, > name text, > time bigint, > sub_list_id bigserial > ); > > CREATE TABLE sub_list ( > sub_list_id bigint, > element_name text, > element_value numeric, > time bigint > ); > > I have tried rewriting the same query in the following ways, none of them > finish running: > > SELECT sub_list.* > FROM sub_list, > (SELECT list.sub_list_id > FROM list, > (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000 > ) AS parentquery > WHERE list.list_id = parentquery.list_id > ) AS listquery > WHERE sub_list.sub_list_id = listquery.sub_list_id; > > SELECT sub_list.* > FROM sub_list > WHERE sub_list_id IN (SELECT sub_list_id > FROM list > WHERE list_id IN (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000); > > By "not finishing running" i mean that i run them in psql and they hang > there, i notice that my datanode processes (postgres process name) are > pegged at 100% CPU usage, but iotop doesn't show any disk activity. When i > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C out. > > My parent table has 1 million rows, my list table has 10 million rows, and > my sub_list_table has 100 million rows. > > I am able to run all of these queries just fine on regular Postgres with the > same amount of data in the tables. > > Any help is greatly appreciated, Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your CREATE TABLE statement. Otherwise what it is doing is pulling up a lot of data to the coordinator for joining. Also, let us know what indexes you have created. > Nick > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > -- Mason Sharp StormDB - http://www.stormdb.com The Database Cloud |
|
From: Nick M. <nm...@gm...> - 2012-08-22 20:59:34
|
Sorry, yes i forgot to including my indexes, they are as follows:
// parent indexes
CREATE INDEX parent_list_id_index ON parent(list_id);
CREATE INDEX parent_time_index ON parent(time);
// list indexes
CREATE INDEX list_list_id_index ON list(list_id);
CREATE INDEX list_sub_list_id_index ON list(sub_list_id);
// sub list indexes
CREATE INDEX sub_list_sub_list_id_index ON sub_list(sub_list_id);
*EXPLAIN ANALYZE from regular Postgres (8.4):*
test_db=# EXPLAIN ANALYZE
SELECT sub_list.*
FROM sub_list
JOIN list AS listquery
ON listquery.sub_list_id = sub_list.sub_list_id
JOIN parent AS parentquery
ON parentquery.list_id = listquery.list_id
WHERE parentquery.time > 20000 AND
parentquery.time < 30000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------
Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual
time=25.015..488.914 rows=39996 loops=1)
-> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual
time=25.002..446.462 rows=19998 loops=1)
Hash Cond: (listquery.list_id = parentquery.list_id)
-> Seq Scan on list listquery (cost=0.00..35067.80 rows=1840080
width=16) (actual time=0.055..160.550 rows=2000000 loops=1)
-> Hash (cost=456.28..456.28 rows=11196 width=8) (actual
time=14.105..14.105 rows=9999 loops=1)
-> Index Scan using parent_time_index on parent parentquery
(cost=0.00..456.28 rows=11196 width=8) (actual time=0.061..8.450 rows=9999
l
oops=1)
Index Cond: ((time > 20000) AND (time < 30000))
-> Index Scan using sub_list_sub_list_id_index on sub_list
(cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0.
002 rows=2 loops=19998)
Index Cond: (sub_list.sub_list_id = listquery.sub_list_id)
Total runtime: 491.447 ms
*///////////////////////////////////////*
*
*
test_db=# EXPLAIN ANALYZE
SELECT sub_list.*
FROM sub_list,
(SELECT list.sub_list_id
FROM list,
(SELECT list_id
FROM parent
WHERE time > 20000 AND
time < 30000
) AS parentquery
WHERE list.list_id = parentquery.list_id
) AS listquery
WHERE sub_list.sub_list_id = listquery.sub_list_id;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------
Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual
time=25.493..494.204 rows=39996 loops=1)
-> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual
time=25.479..452.275 rows=19998 loops=1)
Hash Cond: (list.list_id = parent.list_id)
-> Seq Scan on list (cost=0.00..35067.80 rows=1840080 width=16)
(actual time=0.051..161.849 rows=2000000 loops=1)
-> Hash (cost=456.28..456.28 rows=11196 width=8) (actual
time=15.444..15.444 rows=9999 loops=1)
-> Index Scan using parent_time_index on parent
(cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..9.568 rows=9999
loops=1)
Index Cond: ((time > 20000) AND (time < 30000))
-> Index Scan using sub_list_sub_list_id_index on sub_list
(cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0.
002 rows=2 loops=19998)
Index Cond: (sub_list.sub_list_id = list.sub_list_id)
Total runtime: 496.729 ms
*///////////////////////////////////////*
test_db=# EXPLAIN ANALYZE
SELECT sub_list.*
FROM sub_list
WHERE sub_list_id IN (SELECT sub_list_id
FROM list
WHERE list_id IN (SELECT list_id
FROM parent
WHERE time > 20000 AND
time < 30000
)
);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----
Hash Semi Join (cost=42074.51..123747.06 rows=47078 width=253) (actual
time=460.406..1376.375 rows=39996 loops=1)
Hash Cond: (sub_list.sub_list_id = list.sub_list_id)
-> Seq Scan on sub_list (cost=0.00..72183.31 rows=2677131 width=253)
(actual time=0.080..337.862 rows=4000000 loops=1)
-> Hash (cost=41781.08..41781.08 rows=23475 width=8) (actual
time=439.640..439.640 rows=19998 loops=1)
-> Hash Semi Join (cost=596.23..41781.08 rows=23475 width=8)
(actual time=19.608..436.503 rows=19998 loops=1)
Hash Cond: (list.list_id = parent.list_id)
-> Seq Scan on list (cost=0.00..35067.80 rows=1840080
width=16) (actual time=0.022..159.590 rows=2000000 loops=1)
-> Hash (cost=456.28..456.28 rows=11196 width=8) (actual
time=9.989..9.989 rows=9999 loops=1)
-> Index Scan using parent_time_index on parent
(cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..6.125 rows=9999
loops
=1)
Index Cond: ((base_time > 20000) AND (base_time
< 30000))
Total runtime: 1378.711 ms
Regards,
Nick
On Wed, Aug 22, 2012 at 4:53 PM, Mason Sharp <ma...@st...> wrote:
> On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> wrote:
> > All,
> >
> > I am trying to run the following query which never finishes running:
> >
> > SELECT sub_list.*
> > FROM sub_list
> > JOIN list AS listquery
> > ON listquery.sub_list_id = sub_list.sub_list_id
> > JOIN parent AS parentquery
> > ON parentquery.list_id = listquery.list_id
> > WHERE parentquery.time > 20000 AND
> > parentquery.time < 30000;
> >
> > Please excuse my dumb table and column names this is just an example with
> > the same structure as a real query of mine.
> >
> > CREATE TABLE parent (
> > name text,
> > time bigint,
> > list_id bigserial
> > );
> >
> > CREATE TABLE list (
> > list_id bigint,
> > name text,
> > time bigint,
> > sub_list_id bigserial
> > );
> >
> > CREATE TABLE sub_list (
> > sub_list_id bigint,
> > element_name text,
> > element_value numeric,
> > time bigint
> > );
> >
> > I have tried rewriting the same query in the following ways, none of them
> > finish running:
> >
> > SELECT sub_list.*
> > FROM sub_list,
> > (SELECT list.sub_list_id
> > FROM list,
> > (SELECT list_id
> > FROM parent
> > WHERE time > 20000 AND
> > time < 30000
> > ) AS parentquery
> > WHERE list.list_id = parentquery.list_id
> > ) AS listquery
> > WHERE sub_list.sub_list_id = listquery.sub_list_id;
> >
> > SELECT sub_list.*
> > FROM sub_list
> > WHERE sub_list_id IN (SELECT sub_list_id
> > FROM list
> > WHERE list_id IN (SELECT list_id
> > FROM parent
> > WHERE time > 20000 AND
> > time < 30000);
> >
> > By "not finishing running" i mean that i run them in psql and they hang
> > there, i notice that my datanode processes (postgres process name) are
> > pegged at 100% CPU usage, but iotop doesn't show any disk activity. When
> i
> > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C out.
> >
> > My parent table has 1 million rows, my list table has 10 million rows,
> and
> > my sub_list_table has 100 million rows.
> >
> > I am able to run all of these queries just fine on regular Postgres with
> the
> > same amount of data in the tables.
> >
> > Any help is greatly appreciated,
>
> Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your
> CREATE TABLE statement. Otherwise what it is doing is pulling up a lot
> of data to the coordinator for joining.
>
> Also, let us know what indexes you have created.
>
> > Nick
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Postgres-xc-general mailing list
> > Pos...@li...
> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
> >
>
>
>
> --
> Mason Sharp
>
> StormDB - http://www.stormdb.com
> The Database Cloud
>
|
|
From: Nick M. <nm...@gm...> - 2012-08-22 21:19:08
|
Mason, I tried adding the DISTRIBUTE BY HASH() and got the same results. Below are my new table definitions: CREATE TABLE parent ( name text, time bigint, list_id bigserial ) DISTRIBUTE BY HASH (list_id); CREATE TABLE list ( list_id bigint, name text, time bigint, sub_list_id bigserial ) DISTRIBUTE BY HASH (list_id); CREATE TABLE sub_list ( sub_list_id bigint, element_name text, element_value numeric, time bigint ) DISTRIBUTE BY HASH (sub_list_id); -Nick On Wed, Aug 22, 2012 at 4:58 PM, Nick Maludy <nm...@gm...> wrote: > Sorry, yes i forgot to including my indexes, they are as follows: > > // parent indexes > CREATE INDEX parent_list_id_index ON parent(list_id); > CREATE INDEX parent_time_index ON parent(time); > > // list indexes > CREATE INDEX list_list_id_index ON list(list_id); > CREATE INDEX list_sub_list_id_index ON list(sub_list_id); > > // sub list indexes > CREATE INDEX sub_list_sub_list_id_index ON sub_list(sub_list_id); > > *EXPLAIN ANALYZE from regular Postgres (8.4):* > > test_db=# EXPLAIN ANALYZE > SELECT sub_list.* > FROM sub_list > JOIN list AS listquery > ON listquery.sub_list_id = sub_list.sub_list_id > JOIN parent AS parentquery > ON parentquery.list_id = listquery.list_id > WHERE parentquery.time > 20000 AND > parentquery.time < 30000; > > > QUERY PLAN > > > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------ > Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual > time=25.015..488.914 rows=39996 loops=1) > -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual > time=25.002..446.462 rows=19998 loops=1) > Hash Cond: (listquery.list_id = parentquery.list_id) > -> Seq Scan on list listquery (cost=0.00..35067.80 rows=1840080 > width=16) (actual time=0.055..160.550 rows=2000000 loops=1) > -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual > time=14.105..14.105 rows=9999 loops=1) > -> Index Scan using parent_time_index on parent > parentquery (cost=0.00..456.28 rows=11196 width=8) (actual > time=0.061..8.450 rows=9999 l > oops=1) > Index Cond: ((time > 20000) AND (time < 30000)) > -> Index Scan using sub_list_sub_list_id_index on sub_list > (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. > 002 rows=2 loops=19998) > Index Cond: (sub_list.sub_list_id = listquery.sub_list_id) > Total runtime: 491.447 ms > > *///////////////////////////////////////* > * > * > test_db=# EXPLAIN ANALYZE > SELECT sub_list.* > FROM sub_list, > (SELECT list.sub_list_id > FROM list, > (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000 > ) AS parentquery > WHERE list.list_id = parentquery.list_id > ) AS listquery > WHERE sub_list.sub_list_id = listquery.sub_list_id; > > QUERY PLAN > > > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ------------------------ > Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual > time=25.493..494.204 rows=39996 loops=1) > -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual > time=25.479..452.275 rows=19998 loops=1) > Hash Cond: (list.list_id = parent.list_id) > -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 width=16) > (actual time=0.051..161.849 rows=2000000 loops=1) > -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual > time=15.444..15.444 rows=9999 loops=1) > -> Index Scan using parent_time_index on parent > (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..9.568 rows=9999 > loops=1) > Index Cond: ((time > 20000) AND (time < 30000)) > -> Index Scan using sub_list_sub_list_id_index on sub_list > (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. > 002 rows=2 loops=19998) > Index Cond: (sub_list.sub_list_id = list.sub_list_id) > Total runtime: 496.729 ms > > *///////////////////////////////////////* > > test_db=# EXPLAIN ANALYZE > SELECT sub_list.* > FROM sub_list > WHERE sub_list_id IN (SELECT sub_list_id > FROM list > WHERE list_id IN (SELECT list_id > FROM parent > WHERE time > 20000 AND > time < 30000 > ) > ); > > QUERY PLAN > > > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > ---- > Hash Semi Join (cost=42074.51..123747.06 rows=47078 width=253) (actual > time=460.406..1376.375 rows=39996 loops=1) > Hash Cond: (sub_list.sub_list_id = list.sub_list_id) > -> Seq Scan on sub_list (cost=0.00..72183.31 rows=2677131 width=253) > (actual time=0.080..337.862 rows=4000000 loops=1) > -> Hash (cost=41781.08..41781.08 rows=23475 width=8) (actual > time=439.640..439.640 rows=19998 loops=1) > -> Hash Semi Join (cost=596.23..41781.08 rows=23475 width=8) > (actual time=19.608..436.503 rows=19998 loops=1) > Hash Cond: (list.list_id = parent.list_id) > -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 > width=16) (actual time=0.022..159.590 rows=2000000 loops=1) > -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual > time=9.989..9.989 rows=9999 loops=1) > -> Index Scan using parent_time_index on parent > (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..6.125 rows=9999 > loops > =1) > Index Cond: ((base_time > 20000) AND (base_time > < 30000)) > Total runtime: 1378.711 ms > > Regards, > Nick > > On Wed, Aug 22, 2012 at 4:53 PM, Mason Sharp <ma...@st...> wrote: > >> On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> wrote: >> > All, >> > >> > I am trying to run the following query which never finishes running: >> > >> > SELECT sub_list.* >> > FROM sub_list >> > JOIN list AS listquery >> > ON listquery.sub_list_id = sub_list.sub_list_id >> > JOIN parent AS parentquery >> > ON parentquery.list_id = listquery.list_id >> > WHERE parentquery.time > 20000 AND >> > parentquery.time < 30000; >> > >> > Please excuse my dumb table and column names this is just an example >> with >> > the same structure as a real query of mine. >> > >> > CREATE TABLE parent ( >> > name text, >> > time bigint, >> > list_id bigserial >> > ); >> > >> > CREATE TABLE list ( >> > list_id bigint, >> > name text, >> > time bigint, >> > sub_list_id bigserial >> > ); >> > >> > CREATE TABLE sub_list ( >> > sub_list_id bigint, >> > element_name text, >> > element_value numeric, >> > time bigint >> > ); >> > >> > I have tried rewriting the same query in the following ways, none of >> them >> > finish running: >> > >> > SELECT sub_list.* >> > FROM sub_list, >> > (SELECT list.sub_list_id >> > FROM list, >> > (SELECT list_id >> > FROM parent >> > WHERE time > 20000 AND >> > time < 30000 >> > ) AS parentquery >> > WHERE list.list_id = parentquery.list_id >> > ) AS listquery >> > WHERE sub_list.sub_list_id = listquery.sub_list_id; >> > >> > SELECT sub_list.* >> > FROM sub_list >> > WHERE sub_list_id IN (SELECT sub_list_id >> > FROM list >> > WHERE list_id IN (SELECT list_id >> > FROM parent >> > WHERE time > 20000 AND >> > time < 30000); >> > >> > By "not finishing running" i mean that i run them in psql and they hang >> > there, i notice that my datanode processes (postgres process name) are >> > pegged at 100% CPU usage, but iotop doesn't show any disk activity. >> When i >> > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C >> out. >> > >> > My parent table has 1 million rows, my list table has 10 million rows, >> and >> > my sub_list_table has 100 million rows. >> > >> > I am able to run all of these queries just fine on regular Postgres >> with the >> > same amount of data in the tables. >> > >> > Any help is greatly appreciated, >> >> Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your >> CREATE TABLE statement. Otherwise what it is doing is pulling up a lot >> of data to the coordinator for joining. >> >> Also, let us know what indexes you have created. >> >> > Nick >> > >> > >> ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> Discussions >> > will include endpoint security, mobile security and the latest in >> malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Postgres-xc-general mailing list >> > Pos...@li... >> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >> > >> >> >> >> -- >> Mason Sharp >> >> StormDB - http://www.stormdb.com >> The Database Cloud >> > > |
|
From: Michael P. <mic...@gm...> - 2012-08-23 00:22:55
|
I had a look at your table structures, and you have to do a double join as
you use 2 different keys to join if you use a hash distribution for all
your tables.
The first one (list_id) links parent and list, the second one (sub_list_id)
links list and sublist.
If you do that you will finish with such ressource consuming queries like
(this is similar to your query above):
postgres=# explain verbose select sub_list.* from sub_list, list, parent
where parent.list_id = list.list_id AND list.sub_list_id =
sub_list.sub_list_id AND parent.time > 20000 AND parent.time < 30000;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------
Hash Join (cost=0.01..0.06 rows=1 width=80)
Output: sub_list.sub_list_id, sub_list.element_name,
sub_list.element_value, sub_list."time"
Hash Cond: (sub_list.sub_list_id = list.sub_list_id)
-> Data Node Scan on sub_list "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00
rows=1000 width=80)
Output: sub_list.sub_list_id, sub_list.element_name,
sub_list.element_value, sub_list."time"
Node/s: dn1, dn2
Remote query: SELECT sub_list_id, element_name, element_value,
"time" FROM ONLY sub_list WHERE true
-> Hash (cost=0.00..0.00 rows=1000 width=8)
Output: list.sub_list_id
-> Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00
rows=1000 width=8)
Output: list.sub_list_id
Node/s: dn1, dn2
Remote query: SELECT l.a_2 FROM ((SELECT list.list_id,
list.sub_list_id FROM ONLY list WHERE true) l(a_1, a_2) JOIN (SELECT
parent.list_id FROM ONLY parent WHERE ((parent."time" > 20000) AND
(parent."time" < 30000))) r(a_1
) ON (true)) WHERE (l.a_1 = r.a_1)
(13 rows)
The best advice I could give you here is to mix replicated and hash tables.
For example:
postgres=# alter table parent distribute by hash(list_id);
ALTER TABLE
postgres=# alter table list distribute by hash(list_id);
ALTER TABLE
postgres=# alter table sub_list distribute by replication;
ALTER TABLE
Or:
postgres=# alter table parent distribute by replication;
ALTER TABLE
postgres=# alter table list distribute by replication;
ALTER TABLE
postgres=# alter table sub_list distribute by hash(sub_list_id)
ALTER TABLE
If you do that you queries will have plans like this one:
postgres=# EXPLAIN
ANALYZE
SELECT sub_list.*
FROM sub_list,
(SELECT list.sub_list_id
FROM list,
(SELECT list_id
FROM parent
WHERE time > 20000 AND
time < 30000
) AS parentquery
WHERE list.list_id = parentquery.list_id
) AS listquery
WHERE sub_list.sub_list_id = listquery.sub_list_id;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=0.00..0.00 rows=1000
width=80) (actual time=19.633..19.633 rows=0 loops=1)
Node/s: dn1, dn2
Total runtime: 19.666 ms
This means that your query can be directly pushed to Datanodes, so you get
the best performance possible.
As a last advice, you need to choose as replicated the tables that have the
less number of writes. This depends on your application.
On Thu, Aug 23, 2012 at 6:18 AM, Nick Maludy <nm...@gm...> wrote:
> Mason,
>
> I tried adding the DISTRIBUTE BY HASH() and got the same results. Below
> are my new table definitions:
>
> CREATE TABLE parent (
> name text,
> time bigint,
> list_id bigserial
> ) DISTRIBUTE BY HASH (list_id);
>
> CREATE TABLE list (
> list_id bigint,
> name text,
> time bigint,
> sub_list_id bigserial
> ) DISTRIBUTE BY HASH (list_id);
>
> CREATE TABLE sub_list (
> sub_list_id bigint,
> element_name text,
> element_value numeric,
> time bigint
> ) DISTRIBUTE BY HASH (sub_list_id);
>
> -Nick
>
> On Wed, Aug 22, 2012 at 4:58 PM, Nick Maludy <nm...@gm...> wrote:
>
>> Sorry, yes i forgot to including my indexes, they are as follows:
>>
>> // parent indexes
>> CREATE INDEX parent_list_id_index ON parent(list_id);
>> CREATE INDEX parent_time_index ON parent(time);
>>
>> // list indexes
>> CREATE INDEX list_list_id_index ON list(list_id);
>> CREATE INDEX list_sub_list_id_index ON list(sub_list_id);
>>
>> // sub list indexes
>> CREATE INDEX sub_list_sub_list_id_index ON sub_list(sub_list_id);
>>
>> *EXPLAIN ANALYZE from regular Postgres (8.4):*
>>
>> test_db=# EXPLAIN ANALYZE
>> SELECT sub_list.*
>> FROM sub_list
>> JOIN list AS listquery
>> ON listquery.sub_list_id = sub_list.sub_list_id
>> JOIN parent AS parentquery
>> ON parentquery.list_id = listquery.list_id
>> WHERE parentquery.time > 20000 AND
>> parentquery.time < 30000;
>>
>>
>> QUERY PLAN
>>
>>
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> ------------------------
>> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual
>> time=25.015..488.914 rows=39996 loops=1)
>> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual
>> time=25.002..446.462 rows=19998 loops=1)
>> Hash Cond: (listquery.list_id = parentquery.list_id)
>> -> Seq Scan on list listquery (cost=0.00..35067.80
>> rows=1840080 width=16) (actual time=0.055..160.550 rows=2000000 loops=1)
>> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual
>> time=14.105..14.105 rows=9999 loops=1)
>> -> Index Scan using parent_time_index on parent
>> parentquery (cost=0.00..456.28 rows=11196 width=8) (actual
>> time=0.061..8.450 rows=9999 l
>> oops=1)
>> Index Cond: ((time > 20000) AND (time < 30000))
>> -> Index Scan using sub_list_sub_list_id_index on sub_list
>> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0.
>> 002 rows=2 loops=19998)
>> Index Cond: (sub_list.sub_list_id = listquery.sub_list_id)
>> Total runtime: 491.447 ms
>>
>> *///////////////////////////////////////*
>> *
>> *
>> test_db=# EXPLAIN ANALYZE
>> SELECT sub_list.*
>> FROM sub_list,
>> (SELECT list.sub_list_id
>> FROM list,
>> (SELECT list_id
>> FROM parent
>> WHERE time > 20000 AND
>> time < 30000
>> ) AS parentquery
>> WHERE list.list_id = parentquery.list_id
>> ) AS listquery
>> WHERE sub_list.sub_list_id = listquery.sub_list_id;
>>
>> QUERY PLAN
>>
>>
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> ------------------------
>> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual
>> time=25.493..494.204 rows=39996 loops=1)
>> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual
>> time=25.479..452.275 rows=19998 loops=1)
>> Hash Cond: (list.list_id = parent.list_id)
>> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080
>> width=16) (actual time=0.051..161.849 rows=2000000 loops=1)
>> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual
>> time=15.444..15.444 rows=9999 loops=1)
>> -> Index Scan using parent_time_index on parent
>> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..9.568 rows=9999
>> loops=1)
>> Index Cond: ((time > 20000) AND (time < 30000))
>> -> Index Scan using sub_list_sub_list_id_index on sub_list
>> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0.
>> 002 rows=2 loops=19998)
>> Index Cond: (sub_list.sub_list_id = list.sub_list_id)
>> Total runtime: 496.729 ms
>>
>> *///////////////////////////////////////*
>>
>> test_db=# EXPLAIN ANALYZE
>> SELECT sub_list.*
>> FROM sub_list
>> WHERE sub_list_id IN (SELECT sub_list_id
>> FROM list
>> WHERE list_id IN (SELECT list_id
>> FROM parent
>> WHERE time > 20000 AND
>> time < 30000
>> )
>> );
>>
>> QUERY PLAN
>>
>>
>>
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> ----
>> Hash Semi Join (cost=42074.51..123747.06 rows=47078 width=253) (actual
>> time=460.406..1376.375 rows=39996 loops=1)
>> Hash Cond: (sub_list.sub_list_id = list.sub_list_id)
>> -> Seq Scan on sub_list (cost=0.00..72183.31 rows=2677131 width=253)
>> (actual time=0.080..337.862 rows=4000000 loops=1)
>> -> Hash (cost=41781.08..41781.08 rows=23475 width=8) (actual
>> time=439.640..439.640 rows=19998 loops=1)
>> -> Hash Semi Join (cost=596.23..41781.08 rows=23475 width=8)
>> (actual time=19.608..436.503 rows=19998 loops=1)
>> Hash Cond: (list.list_id = parent.list_id)
>> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080
>> width=16) (actual time=0.022..159.590 rows=2000000 loops=1)
>> -> Hash (cost=456.28..456.28 rows=11196 width=8)
>> (actual time=9.989..9.989 rows=9999 loops=1)
>> -> Index Scan using parent_time_index on parent
>> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..6.125 rows=9999
>> loops
>> =1)
>> Index Cond: ((base_time > 20000) AND
>> (base_time < 30000))
>> Total runtime: 1378.711 ms
>>
>> Regards,
>> Nick
>>
>> On Wed, Aug 22, 2012 at 4:53 PM, Mason Sharp <ma...@st...> wrote:
>>
>>> On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> wrote:
>>> > All,
>>> >
>>> > I am trying to run the following query which never finishes running:
>>> >
>>> > SELECT sub_list.*
>>> > FROM sub_list
>>> > JOIN list AS listquery
>>> > ON listquery.sub_list_id = sub_list.sub_list_id
>>> > JOIN parent AS parentquery
>>> > ON parentquery.list_id = listquery.list_id
>>> > WHERE parentquery.time > 20000 AND
>>> > parentquery.time < 30000;
>>> >
>>> > Please excuse my dumb table and column names this is just an example
>>> with
>>> > the same structure as a real query of mine.
>>> >
>>> > CREATE TABLE parent (
>>> > name text,
>>> > time bigint,
>>> > list_id bigserial
>>> > );
>>> >
>>> > CREATE TABLE list (
>>> > list_id bigint,
>>> > name text,
>>> > time bigint,
>>> > sub_list_id bigserial
>>> > );
>>> >
>>> > CREATE TABLE sub_list (
>>> > sub_list_id bigint,
>>> > element_name text,
>>> > element_value numeric,
>>> > time bigint
>>> > );
>>> >
>>> > I have tried rewriting the same query in the following ways, none of
>>> them
>>> > finish running:
>>> >
>>> > SELECT sub_list.*
>>> > FROM sub_list,
>>> > (SELECT list.sub_list_id
>>> > FROM list,
>>> > (SELECT list_id
>>> > FROM parent
>>> > WHERE time > 20000 AND
>>> > time < 30000
>>> > ) AS parentquery
>>> > WHERE list.list_id = parentquery.list_id
>>> > ) AS listquery
>>> > WHERE sub_list.sub_list_id = listquery.sub_list_id;
>>> >
>>> > SELECT sub_list.*
>>> > FROM sub_list
>>> > WHERE sub_list_id IN (SELECT sub_list_id
>>> > FROM list
>>> > WHERE list_id IN (SELECT list_id
>>> > FROM parent
>>> > WHERE time > 20000 AND
>>> > time < 30000);
>>> >
>>> > By "not finishing running" i mean that i run them in psql and they hang
>>> > there, i notice that my datanode processes (postgres process name) are
>>> > pegged at 100% CPU usage, but iotop doesn't show any disk activity.
>>> When i
>>> > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C
>>> out.
>>> >
>>> > My parent table has 1 million rows, my list table has 10 million rows,
>>> and
>>> > my sub_list_table has 100 million rows.
>>> >
>>> > I am able to run all of these queries just fine on regular Postgres
>>> with the
>>> > same amount of data in the tables.
>>> >
>>> > Any help is greatly appreciated,
>>>
>>> Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your
>>> CREATE TABLE statement. Otherwise what it is doing is pulling up a lot
>>> of data to the coordinator for joining.
>>>
>>> Also, let us know what indexes you have created.
>>>
>>> > Nick
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Live Security Virtual Conference
>>> > Exclusive live event will cover all the ways today's security and
>>> > threat landscape has changed and how IT managers can respond.
>>> Discussions
>>> > will include endpoint security, mobile security and the latest in
>>> malware
>>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> > _______________________________________________
>>> > Postgres-xc-general mailing list
>>> > Pos...@li...
>>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>>> >
>>>
>>>
>>>
>>> --
>>> Mason Sharp
>>>
>>> StormDB - http://www.stormdb.com
>>> The Database Cloud
>>>
>>
>>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Postgres-xc-general mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>
>
--
Michael Paquier
http://michael.otacoo.com
|
|
From: Mason S. <ma...@st...> - 2012-08-23 00:23:19
|
On Wed, Aug 22, 2012 at 5:18 PM, Nick Maludy <nm...@gm...> wrote: > Mason, > > I tried adding the DISTRIBUTE BY HASH() and got the same results. Below are > my new table definitions: > > CREATE TABLE parent ( > name text, > time bigint, > list_id bigserial > ) DISTRIBUTE BY HASH (list_id); > > CREATE TABLE list ( > list_id bigint, > name text, > time bigint, > sub_list_id bigserial > ) DISTRIBUTE BY HASH (list_id); > > CREATE TABLE sub_list ( > sub_list_id bigint, > element_name text, > element_value numeric, > time bigint > ) DISTRIBUTE BY HASH (sub_list_id); > > -Nick > I took a closer look. Actually, your biggest tables join on sub_list_id, so you should distribute on that for list and sub_list. How large do you expect parent to grow? Will you always have those proportions? Is it completely static? You may be able to get away with distributing parent by REPLICATION. If you do that, that join should be folded in on the same step with other join. > On Wed, Aug 22, 2012 at 4:58 PM, Nick Maludy <nm...@gm...> wrote: >> >> Sorry, yes i forgot to including my indexes, they are as follows: >> >> // parent indexes >> CREATE INDEX parent_list_id_index ON parent(list_id); >> CREATE INDEX parent_time_index ON parent(time); >> >> // list indexes >> CREATE INDEX list_list_id_index ON list(list_id); >> CREATE INDEX list_sub_list_id_index ON list(sub_list_id); >> >> // sub list indexes >> CREATE INDEX sub_list_sub_list_id_index ON sub_list(sub_list_id); >> >> EXPLAIN ANALYZE from regular Postgres (8.4): >> >> test_db=# EXPLAIN ANALYZE >> SELECT sub_list.* >> FROM sub_list >> JOIN list AS listquery >> ON listquery.sub_list_id = sub_list.sub_list_id >> JOIN parent AS parentquery >> ON parentquery.list_id = listquery.list_id >> WHERE parentquery.time > 20000 AND >> parentquery.time < 30000; >> >> >> QUERY PLAN >> >> >> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> ------------------------ >> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual >> time=25.015..488.914 rows=39996 loops=1) >> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual >> time=25.002..446.462 rows=19998 loops=1) >> Hash Cond: (listquery.list_id = parentquery.list_id) >> -> Seq Scan on list listquery (cost=0.00..35067.80 rows=1840080 >> width=16) (actual time=0.055..160.550 rows=2000000 loops=1) >> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual >> time=14.105..14.105 rows=9999 loops=1) >> -> Index Scan using parent_time_index on parent >> parentquery (cost=0.00..456.28 rows=11196 width=8) (actual >> time=0.061..8.450 rows=9999 l >> oops=1) >> Index Cond: ((time > 20000) AND (time < 30000)) >> -> Index Scan using sub_list_sub_list_id_index on sub_list >> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. >> 002 rows=2 loops=19998) >> Index Cond: (sub_list.sub_list_id = listquery.sub_list_id) >> Total runtime: 491.447 ms >> >> /////////////////////////////////////// >> >> test_db=# EXPLAIN ANALYZE >> SELECT sub_list.* >> FROM sub_list, >> (SELECT list.sub_list_id >> FROM list, >> (SELECT list_id >> FROM parent >> WHERE time > 20000 AND >> time < 30000 >> ) AS parentquery >> WHERE list.list_id = parentquery.list_id >> ) AS listquery >> WHERE sub_list.sub_list_id = listquery.sub_list_id; >> >> QUERY PLAN >> >> >> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> ------------------------ >> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual >> time=25.493..494.204 rows=39996 loops=1) >> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual >> time=25.479..452.275 rows=19998 loops=1) >> Hash Cond: (list.list_id = parent.list_id) >> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 width=16) >> (actual time=0.051..161.849 rows=2000000 loops=1) >> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual >> time=15.444..15.444 rows=9999 loops=1) >> -> Index Scan using parent_time_index on parent >> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..9.568 rows=9999 >> loops=1) >> Index Cond: ((time > 20000) AND (time < 30000)) >> -> Index Scan using sub_list_sub_list_id_index on sub_list >> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. >> 002 rows=2 loops=19998) >> Index Cond: (sub_list.sub_list_id = list.sub_list_id) >> Total runtime: 496.729 ms >> >> /////////////////////////////////////// >> >> test_db=# EXPLAIN ANALYZE >> SELECT sub_list.* >> FROM sub_list >> WHERE sub_list_id IN (SELECT sub_list_id >> FROM list >> WHERE list_id IN (SELECT list_id >> FROM parent >> WHERE time > 20000 AND >> time < 30000 >> ) >> ); >> >> QUERY PLAN >> >> >> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> ---- >> Hash Semi Join (cost=42074.51..123747.06 rows=47078 width=253) (actual >> time=460.406..1376.375 rows=39996 loops=1) >> Hash Cond: (sub_list.sub_list_id = list.sub_list_id) >> -> Seq Scan on sub_list (cost=0.00..72183.31 rows=2677131 width=253) >> (actual time=0.080..337.862 rows=4000000 loops=1) >> -> Hash (cost=41781.08..41781.08 rows=23475 width=8) (actual >> time=439.640..439.640 rows=19998 loops=1) >> -> Hash Semi Join (cost=596.23..41781.08 rows=23475 width=8) >> (actual time=19.608..436.503 rows=19998 loops=1) >> Hash Cond: (list.list_id = parent.list_id) >> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 >> width=16) (actual time=0.022..159.590 rows=2000000 loops=1) >> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual >> time=9.989..9.989 rows=9999 loops=1) >> -> Index Scan using parent_time_index on parent >> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..6.125 rows=9999 >> loops >> =1) >> Index Cond: ((base_time > 20000) AND (base_time >> < 30000)) >> Total runtime: 1378.711 ms >> >> Regards, >> Nick >> >> On Wed, Aug 22, 2012 at 4:53 PM, Mason Sharp <ma...@st...> wrote: >>> >>> On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> wrote: >>> > All, >>> > >>> > I am trying to run the following query which never finishes running: >>> > >>> > SELECT sub_list.* >>> > FROM sub_list >>> > JOIN list AS listquery >>> > ON listquery.sub_list_id = sub_list.sub_list_id >>> > JOIN parent AS parentquery >>> > ON parentquery.list_id = listquery.list_id >>> > WHERE parentquery.time > 20000 AND >>> > parentquery.time < 30000; >>> > >>> > Please excuse my dumb table and column names this is just an example >>> > with >>> > the same structure as a real query of mine. >>> > >>> > CREATE TABLE parent ( >>> > name text, >>> > time bigint, >>> > list_id bigserial >>> > ); >>> > >>> > CREATE TABLE list ( >>> > list_id bigint, >>> > name text, >>> > time bigint, >>> > sub_list_id bigserial >>> > ); >>> > >>> > CREATE TABLE sub_list ( >>> > sub_list_id bigint, >>> > element_name text, >>> > element_value numeric, >>> > time bigint >>> > ); >>> > >>> > I have tried rewriting the same query in the following ways, none of >>> > them >>> > finish running: >>> > >>> > SELECT sub_list.* >>> > FROM sub_list, >>> > (SELECT list.sub_list_id >>> > FROM list, >>> > (SELECT list_id >>> > FROM parent >>> > WHERE time > 20000 AND >>> > time < 30000 >>> > ) AS parentquery >>> > WHERE list.list_id = parentquery.list_id >>> > ) AS listquery >>> > WHERE sub_list.sub_list_id = listquery.sub_list_id; >>> > >>> > SELECT sub_list.* >>> > FROM sub_list >>> > WHERE sub_list_id IN (SELECT sub_list_id >>> > FROM list >>> > WHERE list_id IN (SELECT list_id >>> > FROM parent >>> > WHERE time > 20000 AND >>> > time < 30000); >>> > >>> > By "not finishing running" i mean that i run them in psql and they hang >>> > there, i notice that my datanode processes (postgres process name) are >>> > pegged at 100% CPU usage, but iotop doesn't show any disk activity. >>> > When i >>> > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C >>> > out. >>> > >>> > My parent table has 1 million rows, my list table has 10 million rows, >>> > and >>> > my sub_list_table has 100 million rows. >>> > >>> > I am able to run all of these queries just fine on regular Postgres >>> > with the >>> > same amount of data in the tables. >>> > >>> > Any help is greatly appreciated, >>> >>> Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your >>> CREATE TABLE statement. Otherwise what it is doing is pulling up a lot >>> of data to the coordinator for joining. >>> >>> Also, let us know what indexes you have created. >>> >>> > Nick >>> > >>> > >>> > ------------------------------------------------------------------------------ >>> > Live Security Virtual Conference >>> > Exclusive live event will cover all the ways today's security and >>> > threat landscape has changed and how IT managers can respond. >>> > Discussions >>> > will include endpoint security, mobile security and the latest in >>> > malware >>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> > _______________________________________________ >>> > Postgres-xc-general mailing list >>> > Pos...@li... >>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general >>> > >>> >>> >>> >>> -- >>> Mason Sharp >>> >>> StormDB - http://www.stormdb.com >>> The Database Cloud >> >> > -- Mason Sharp StormDB - http://www.stormdb.com The Database Cloud |
|
From: Ashutosh B. <ash...@en...> - 2012-08-23 07:15:03
|
Hi Nick, One of the problems here is that in 1.0, we used only nested loop joins to join between the tables, if those joins can not be evaluated (see details below). But in 2.0, we have changed that to use other kinds of JOINs like merge and hash join. If the solutions mentioned by others do not solve your problem, you may want to try the latest code. But it's not stable yet and the GA for this code is not scheduled yet and may take long. Other thing that you may try, is to set enable_nestloop to false, only for these queries. That's not recommended in production, I guess. Please check PostgreSQL documentation for use of this GUC. For further analysis, please send the EXPLAIN (no ANALYSE) outputs on Postgres-XC. Running EXPLAIN (no ANALYSE) on that query won't hang, I promise ;). Note on when a JOIN can be evaluated on the datanode, The simple rule is if all the rows that can possibly be par of the result of JOIN are located on the same datanode, the JOIN can be evaluated (in XC terminology, shipped) on datanode. This means, 1. any JOIN between two replicated tables, which have atleast one datanode in common, can be evaluated on the datanode. 2. An INNER join on two distributed tables can be shipped to the datanode, if their distribution columns have same type, are distributed on the same set of datanodes, have the same distribution strategy HASH or MODULO. 3. A join between ROUND ROBIN tables can never be shipped to datanodes. 4. If one of the JOINing table is replicated and other is distributed, an INNER join can be shipped to the datanode/s if replicated table is replicated on all the tables where the distributed table is distributed. On Thu, Aug 23, 2012 at 5:52 AM, Mason Sharp <ma...@st...> wrote: > On Wed, Aug 22, 2012 at 5:18 PM, Nick Maludy <nm...@gm...> wrote: > > Mason, > > > > I tried adding the DISTRIBUTE BY HASH() and got the same results. Below > are > > my new table definitions: > > > > CREATE TABLE parent ( > > name text, > > time bigint, > > list_id bigserial > > ) DISTRIBUTE BY HASH (list_id); > > > > CREATE TABLE list ( > > list_id bigint, > > name text, > > time bigint, > > sub_list_id bigserial > > ) DISTRIBUTE BY HASH (list_id); > > > > CREATE TABLE sub_list ( > > sub_list_id bigint, > > element_name text, > > element_value numeric, > > time bigint > > ) DISTRIBUTE BY HASH (sub_list_id); > > > > -Nick > > > > I took a closer look. Actually, your biggest tables join on > sub_list_id, so you should distribute on that for list and sub_list. > > How large do you expect parent to grow? Will you always have those > proportions? Is it completely static? You may be able to get away with > distributing parent by REPLICATION. If you do that, that join should > be folded in on the same step with other join. > > > > > > On Wed, Aug 22, 2012 at 4:58 PM, Nick Maludy <nm...@gm...> wrote: > >> > >> Sorry, yes i forgot to including my indexes, they are as follows: > >> > >> // parent indexes > >> CREATE INDEX parent_list_id_index ON parent(list_id); > >> CREATE INDEX parent_time_index ON parent(time); > >> > >> // list indexes > >> CREATE INDEX list_list_id_index ON list(list_id); > >> CREATE INDEX list_sub_list_id_index ON list(sub_list_id); > >> > >> // sub list indexes > >> CREATE INDEX sub_list_sub_list_id_index ON sub_list(sub_list_id); > >> > >> EXPLAIN ANALYZE from regular Postgres (8.4): > >> > >> test_db=# EXPLAIN ANALYZE > >> SELECT sub_list.* > >> FROM sub_list > >> JOIN list AS listquery > >> ON listquery.sub_list_id = sub_list.sub_list_id > >> JOIN parent AS parentquery > >> ON parentquery.list_id = listquery.list_id > >> WHERE parentquery.time > 20000 AND > >> parentquery.time < 30000; > >> > >> > >> QUERY PLAN > >> > >> > >> > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> ------------------------ > >> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual > >> time=25.015..488.914 rows=39996 loops=1) > >> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual > >> time=25.002..446.462 rows=19998 loops=1) > >> Hash Cond: (listquery.list_id = parentquery.list_id) > >> -> Seq Scan on list listquery (cost=0.00..35067.80 > rows=1840080 > >> width=16) (actual time=0.055..160.550 rows=2000000 loops=1) > >> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual > >> time=14.105..14.105 rows=9999 loops=1) > >> -> Index Scan using parent_time_index on parent > >> parentquery (cost=0.00..456.28 rows=11196 width=8) (actual > >> time=0.061..8.450 rows=9999 l > >> oops=1) > >> Index Cond: ((time > 20000) AND (time < 30000)) > >> -> Index Scan using sub_list_sub_list_id_index on sub_list > >> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. > >> 002 rows=2 loops=19998) > >> Index Cond: (sub_list.sub_list_id = listquery.sub_list_id) > >> Total runtime: 491.447 ms > >> > >> /////////////////////////////////////// > >> > >> test_db=# EXPLAIN ANALYZE > >> SELECT sub_list.* > >> FROM sub_list, > >> (SELECT list.sub_list_id > >> FROM list, > >> (SELECT list_id > >> FROM parent > >> WHERE time > 20000 AND > >> time < 30000 > >> ) AS parentquery > >> WHERE list.list_id = parentquery.list_id > >> ) AS listquery > >> WHERE sub_list.sub_list_id = listquery.sub_list_id; > >> > >> QUERY PLAN > >> > >> > >> > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> ------------------------ > >> Nested Loop (cost=596.23..102441.82 rows=29973 width=253) (actual > >> time=25.493..494.204 rows=39996 loops=1) > >> -> Hash Join (cost=596.23..51970.75 rows=20602 width=8) (actual > >> time=25.479..452.275 rows=19998 loops=1) > >> Hash Cond: (list.list_id = parent.list_id) > >> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 > width=16) > >> (actual time=0.051..161.849 rows=2000000 loops=1) > >> -> Hash (cost=456.28..456.28 rows=11196 width=8) (actual > >> time=15.444..15.444 rows=9999 loops=1) > >> -> Index Scan using parent_time_index on parent > >> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..9.568 > rows=9999 > >> loops=1) > >> Index Cond: ((time > 20000) AND (time < 30000)) > >> -> Index Scan using sub_list_sub_list_id_index on sub_list > >> (cost=0.00..2.42 rows=2 width=253) (actual time=0.001..0. > >> 002 rows=2 loops=19998) > >> Index Cond: (sub_list.sub_list_id = list.sub_list_id) > >> Total runtime: 496.729 ms > >> > >> /////////////////////////////////////// > >> > >> test_db=# EXPLAIN ANALYZE > >> SELECT sub_list.* > >> FROM sub_list > >> WHERE sub_list_id IN (SELECT sub_list_id > >> FROM list > >> WHERE list_id IN (SELECT list_id > >> FROM parent > >> WHERE time > 20000 AND > >> time < 30000 > >> ) > >> ); > >> > >> QUERY PLAN > >> > >> > >> > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > >> ---- > >> Hash Semi Join (cost=42074.51..123747.06 rows=47078 width=253) (actual > >> time=460.406..1376.375 rows=39996 loops=1) > >> Hash Cond: (sub_list.sub_list_id = list.sub_list_id) > >> -> Seq Scan on sub_list (cost=0.00..72183.31 rows=2677131 > width=253) > >> (actual time=0.080..337.862 rows=4000000 loops=1) > >> -> Hash (cost=41781.08..41781.08 rows=23475 width=8) (actual > >> time=439.640..439.640 rows=19998 loops=1) > >> -> Hash Semi Join (cost=596.23..41781.08 rows=23475 width=8) > >> (actual time=19.608..436.503 rows=19998 loops=1) > >> Hash Cond: (list.list_id = parent.list_id) > >> -> Seq Scan on list (cost=0.00..35067.80 rows=1840080 > >> width=16) (actual time=0.022..159.590 rows=2000000 loops=1) > >> -> Hash (cost=456.28..456.28 rows=11196 width=8) > (actual > >> time=9.989..9.989 rows=9999 loops=1) > >> -> Index Scan using parent_time_index on parent > >> (cost=0.00..456.28 rows=11196 width=8) (actual time=0.046..6.125 > rows=9999 > >> loops > >> =1) > >> Index Cond: ((base_time > 20000) AND > (base_time > >> < 30000)) > >> Total runtime: 1378.711 ms > >> > >> Regards, > >> Nick > >> > >> On Wed, Aug 22, 2012 at 4:53 PM, Mason Sharp <ma...@st...> wrote: > >>> > >>> On Wed, Aug 22, 2012 at 3:21 PM, Nick Maludy <nm...@gm...> > wrote: > >>> > All, > >>> > > >>> > I am trying to run the following query which never finishes running: > >>> > > >>> > SELECT sub_list.* > >>> > FROM sub_list > >>> > JOIN list AS listquery > >>> > ON listquery.sub_list_id = sub_list.sub_list_id > >>> > JOIN parent AS parentquery > >>> > ON parentquery.list_id = listquery.list_id > >>> > WHERE parentquery.time > 20000 AND > >>> > parentquery.time < 30000; > >>> > > >>> > Please excuse my dumb table and column names this is just an example > >>> > with > >>> > the same structure as a real query of mine. > >>> > > >>> > CREATE TABLE parent ( > >>> > name text, > >>> > time bigint, > >>> > list_id bigserial > >>> > ); > >>> > > >>> > CREATE TABLE list ( > >>> > list_id bigint, > >>> > name text, > >>> > time bigint, > >>> > sub_list_id bigserial > >>> > ); > >>> > > >>> > CREATE TABLE sub_list ( > >>> > sub_list_id bigint, > >>> > element_name text, > >>> > element_value numeric, > >>> > time bigint > >>> > ); > >>> > > >>> > I have tried rewriting the same query in the following ways, none of > >>> > them > >>> > finish running: > >>> > > >>> > SELECT sub_list.* > >>> > FROM sub_list, > >>> > (SELECT list.sub_list_id > >>> > FROM list, > >>> > (SELECT list_id > >>> > FROM parent > >>> > WHERE time > 20000 AND > >>> > time < 30000 > >>> > ) AS parentquery > >>> > WHERE list.list_id = parentquery.list_id > >>> > ) AS listquery > >>> > WHERE sub_list.sub_list_id = listquery.sub_list_id; > >>> > > >>> > SELECT sub_list.* > >>> > FROM sub_list > >>> > WHERE sub_list_id IN (SELECT sub_list_id > >>> > FROM list > >>> > WHERE list_id IN (SELECT list_id > >>> > FROM parent > >>> > WHERE time > 20000 AND > >>> > time < 30000); > >>> > > >>> > By "not finishing running" i mean that i run them in psql and they > hang > >>> > there, i notice that my datanode processes (postgres process name) > are > >>> > pegged at 100% CPU usage, but iotop doesn't show any disk activity. > >>> > When i > >>> > try to run EXPLAIN ANALYZE i get the same results and have to Ctrl-C > >>> > out. > >>> > > >>> > My parent table has 1 million rows, my list table has 10 million > rows, > >>> > and > >>> > my sub_list_table has 100 million rows. > >>> > > >>> > I am able to run all of these queries just fine on regular Postgres > >>> > with the > >>> > same amount of data in the tables. > >>> > > >>> > Any help is greatly appreciated, > >>> > >>> Please add DISTRIBUTE BY HASH(list_id) as a clause at the end of your > >>> CREATE TABLE statement. Otherwise what it is doing is pulling up a lot > >>> of data to the coordinator for joining. > >>> > >>> Also, let us know what indexes you have created. > >>> > >>> > Nick > >>> > > >>> > > >>> > > ------------------------------------------------------------------------------ > >>> > Live Security Virtual Conference > >>> > Exclusive live event will cover all the ways today's security and > >>> > threat landscape has changed and how IT managers can respond. > >>> > Discussions > >>> > will include endpoint security, mobile security and the latest in > >>> > malware > >>> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >>> > _______________________________________________ > >>> > Postgres-xc-general mailing list > >>> > Pos...@li... > >>> > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > >>> > > >>> > >>> > >>> > >>> -- > >>> Mason Sharp > >>> > >>> StormDB - http://www.stormdb.com > >>> The Database Cloud > >> > >> > > > > > > -- > Mason Sharp > > StormDB - http://www.stormdb.com > The Database Cloud > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Postgres-xc-general mailing list > Pos...@li... > https://lists.sourceforge.net/lists/listinfo/postgres-xc-general > -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |
|
From: Nick M. <nm...@gm...> - 2012-08-23 20:39:37
|
Thanks for the suggestions, based on all of your ideas i have come up with
the following structure:
CREATE TABLE parent (
__id bigserial PRIMARY KEY, // primary key
name text,
time bigint,
) DISTRIBUTE BY HASH (__id);
CREATE INDEX parent___id_index ON parent(__id);
CREATE INDEX parent_time_index ON parent(time);
CREATE TABLE list (
__root_id bigint REFERENCES parent(__id),
__id bigserial, // primary key
name text,
time bigint,
) DISTRIBUTE BY HASH (__root_id);
CREATE INDEX list__root_id_index ON list(__root_id);
CREATE TABLE sub_list (
__root_id bigint REFERENCES parent(__id),
__list_id bigint, // foreign key to list.__id
__id bigserial, // primary key
element_name text,
element_value numeric,
time bigint
) DISTRIBUTE BY HASH (__root_id);
CREATE INDEX sub_list__root_id_index ON sub_list(__root_id);
... etc ub_sub_list
//////////////////
// Query //
/////////////////
SELECT sub_list.*
FROM sub_list
JOIN parent AS parentquery
ON parentquery.__id = sub_list.__root_id
WHERE parentquery.time > 20000 AND
parentquery.time < 30000;
My queries now finish, however they are taking quite a bit of time (about 1
second a piece)
QUERY PLAN
------------------------------------------------------------------------------------------------------
------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
(actual time=183.641..965.710 rows=9998 loops=1)
Node/s: datanode_nick, datanode_lenovo, datanode_alien
Total runtime: 967.672 ms
(3 rows)
If i run this same query locally on regular Postgres i get the following:
QUERY PLAN
------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------
-------------------------
Nested Loop (cost=0.00..72845.41 rows=37269 width=269) (actual
time=0.086..38.557 rows=39996 loops=1
)
-> Index Scan using parent_time_index on parent parentquery
(cost=0.00..408.38 rows=9951 width=8) (actual time=0.065..2.149 rows=9999
loops=1)
Index Cond: ((time > 20000) AND (time < 30000))
-> Index Scan using sub_list__root_index on sub_list (cost=0.00..7.20
rows=6 width=269) (actual time=0.002..
0.002 rows=4 loops=9999)
Index Cond: (sub_list.__root_id = parentquery._
_id)
Total runtime: 41.469 ms
(6 rows)
I'm guessing the extra time is simply network overhead?
|
|
From: Ashutosh B. <ash...@en...> - 2012-08-24 04:48:45
|
On Fri, Aug 24, 2012 at 2:08 AM, Nick Maludy <nm...@gm...> wrote: > Thanks for the suggestions, based on all of your ideas i have come up with > the following structure: > > CREATE TABLE parent ( > __id bigserial PRIMARY KEY, // primary key > name text, > time bigint, > ) DISTRIBUTE BY HASH (__id); > CREATE INDEX parent___id_index ON parent(__id); > CREATE INDEX parent_time_index ON parent(time); > > CREATE TABLE list ( > __root_id bigint REFERENCES parent(__id), > __id bigserial, // primary key > name text, > time bigint, > ) DISTRIBUTE BY HASH (__root_id); > CREATE INDEX list__root_id_index ON list(__root_id); > > CREATE TABLE sub_list ( > __root_id bigint REFERENCES parent(__id), > __list_id bigint, // foreign key to list.__id > __id bigserial, // primary key > element_name text, > element_value numeric, > time bigint > ) DISTRIBUTE BY HASH (__root_id); > CREATE INDEX sub_list__root_id_index ON sub_list(__root_id); > > ... etc ub_sub_list > > ////////////////// > // Query // > ///////////////// > SELECT sub_list.* > FROM sub_list > JOIN parent AS parentquery > ON parentquery.__id = sub_list.__root_id > WHERE parentquery.time > 20000 AND > parentquery.time < 30000; > > My queries now finish, however they are taking quite a bit of time (about > 1 second a piece) > > QUERY PLAN > > > > ------------------------------------------------------------------------------------------------------ > ------------------------ > Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 > width=0) (actual time=183.641..965.710 rows=9998 loops=1) > Node/s: datanode_nick, datanode_lenovo, datanode_alien > Total runtime: 967.672 ms > (3 rows) > > If i run this same query locally on regular Postgres i get the following: > > QUERY PLAN > > > > ------------------------------------------------------------------------------------------------------ > > ------------------------------------------------------------------------------------------------------ > ------------------------- > Nested Loop (cost=0.00..72845.41 rows=37269 width=269) (actual > time=0.086..38.557 rows=39996 loops=1 > ) > -> Index Scan using parent_time_index on parent parentquery > (cost=0.00..408.38 rows=9951 width=8) (actual time=0.065..2.149 rows=9999 > loops=1) > Index Cond: ((time > 20000) AND (time < 30000)) > -> Index Scan using sub_list__root_index on sub_list (cost=0.00..7.20 > rows=6 width=269) (actual time=0.002.. > 0.002 rows=4 loops=9999) > Index Cond: (sub_list.__root_id = parentquery._ > _id) > Total runtime: 41.469 ms > (6 rows) > > I'm guessing the extra time is simply network overhead? > The tag REMOTE_FQS_QUERY tells that the query was directly sent to datanodes and coordinator only acted as a proxy. That's a good sign. Yes, for a single query the time take might be because of XC specific overheads, which also includes the network overhead. But hopefully you get higher throughput when you run more of those queries/DMLs simultaneously. -- Best Wishes, Ashutosh Bapat EntepriseDB Corporation The Enterprise Postgres Company |