Re: [Postgres-xc-general] the cluster cost for normalized tables

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I don't really know what you are looking for, but based on your example,
here simplified:
create table org (id int primary key, name varchar(50) not null);
create table poc (id int primary key, firstname varchar(50) not null,
lastname varchar(50) not null);
create table link (org_id int, poc_id int, function varchar(2), primary key
(org_id,poc_id));

Depending on the distribution type of each table, the key to performance is
to get queries that are completely shipped to Datanodes (remote nodes)
without having to materialize data from multiple Datanodes on Coordinators
(where application connects).

Your query is this one, I just changed the primary keys to integers:
select poc.firstname, poc.lastname, org.id, link.function from poc,org,
link where org.name = 'whatever' and org.id = link.org_id and link.poc_id =
poc.id;

For example, if I create all the tables below as replicated:
create table org (id int primary key, name varchar(50) not null) distribute
by replication;
create table poc (id int primary key, firstname varchar(50) not null,
lastname varchar(50) not null) distribute by replication;
create table link (org_id int, poc_id int, function varchar(2), primary key
(org_id,poc_id)) distribute by replication;
Your SELECT query becomes like this:
postgres=# explain select poc.firstname, poc.lastname, org.id,
link.function from poc,org, link where org.name = 'whatever' and org.id =
link.org_id and link.poc_id = poc.id;
                                 QUERY PLAN
----------------------------------------------------------------------------
 Data Node Scan on "__REMOTE_FQS_QUERY__"  (cost=0.00..0.00 rows=0 width=0)
   Node/s: dn1
(2 rows)
This means that the query is completely shipped to a single node, normal as
the data of all the tables are replicated on all the nodes.

If your tables become distributed:
create table org (id int primary key, name varchar(50) not null) distribute
by hash(id);
create table poc (id int primary key, firstname varchar(50) not null,
lastname varchar(50) not null) distribute by hash(id);
create table link (org_id int, poc_id int, function varchar(2), primary key
(org_id,poc_id)) distribute by hash(org_id);
postgres=# explain select poc.firstname, poc.lastname, org.id,
link.function from poc,org, link where org.name = 'whatever' and org.id =
link.org_id and link.poc_id = poc.id;
                                QUERY PLAN
--------------------------------------------------------------------------
 Nested Loop  (cost=0.00..0.03 rows=1 width=252)
   Join Filter: (link.org_id = org.id)
   ->  Nested Loop  (cost=0.00..0.01 rows=1 width=252)
         Join Filter: (poc.id = link.poc_id)
         ->  Data Node Scan on poc  (cost=0.00..0.00 rows=1000 width=240)
               Node/s: dn1, dn2
         ->  Data Node Scan on link  (cost=0.00..0.00 rows=1000 width=20)
               Node/s: dn1, dn2
   ->  Data Node Scan on org  (cost=0.00..0.00 rows=1000 width=4)
         Node/s: dn1, dn2
(10 rows)
You need to use a nested loop, meaning that you need first to gather the
data from remote nodes, materialize it on Coordinator, then return the
results to client. This is extremely bad for performance.
So your goal to performance is to find the correct combination of table
distribution to have a maximum number of queries being shipped to remote
nodes.

I do not say that it is necessary to replicate all the tables. By doing so,
of course your application will be good in read but write operations will
have to run on all the nodes where table data is located, lowering
performance if your application does a lot of read, so the secret is to
have a good balance.

In XC, you can also located data only on a portion of nodes. This can also
help tune your applications.
http://postgres-xc.sourceforge.net/docs/1_0/sql-createtable.html
CREATE TABLE uses an extension called TO GROUP/NODE to locate data of a
table only on a portion of nodes.
Regards,

On Thu, Apr 19, 2012 at 9:30 PM, Michael Vitale <mic...@ar...> wrote:

>  Thank you for your response.  Can I just give a simple schema example
> and query resulting from it and see if it would suffer in a cluster
> solution using the primary keys, which are system generated names (GUIDs)?
>
> Table ORG (ORG_HANDLE VARCHAR(50) NOT NULL, ORG_NAME VARCHAR2(150) NOT
> NULL);
> primary key: ORG_HANDLE
>
> TABLE POC (POC_HANDLE VARCHAR(50), FIRST_NAME(VARCHAR50) NOT NULL,
> LAST_NAME VARCHAR(50) NOT NULL);
> primary key: POC_HANDLE
>
> Table ORG_POC_LINK (ORG_HANDLE VARCHAR(50) NOT NULL, POC_FUNCTION
> VARCHAR(2) NOT NULL, POC_HANDLE VARCHAR(50) NOT NULL);
> primary key: ORG_HANDLE,POC_FUNCTION,POC_HANDLE
>
>
> Query:
> select POC.FIRST_NAME, POC.LAST_NAME, ORG.HANDLE, OPL.POC_FUNCTION FROM
> POC POC, ORG ORG, ORG_POC_LINK OPL
> WHERE ORG.ORG_NAME = 'whatever' and ORG.ORG_HANDLE = OPL.ORG_HANDLE and
> OPL.POC_HANDLE = POC.POC_HANDLE
>  ------------------------------
> *From:* Ashutosh Bapat [ash...@en...]
> *Sent:* Thursday, April 19, 2012 8:05 AM
> *To:* Michael Vitale
> *Cc:* pos...@li...;
> pos...@li...
> *Subject:* Re: [Postgres-xc-general] the cluster cost for normalized
> tables
>
>  HI Michael,
> The distribution of data depends upon the distribution strategy used. In
> Postgres-XC, we distribute data based on the hash/modulo of the given
> column. It's usually advisable to choose the same distribution for the
> tables which have equi-joins on their distribution columns.
>
> Choosing the right distribution for the tables involved is an art. We need
> the knowledge of table definitions and set of queries to decide the exact
> distribution. If the queries are such that they join on collocated data,
> the performance is greatly improved.
>
> On Thu, Apr 19, 2012 at 4:56 PM, Michael Vitale <mic...@ar...> wrote:
>
>>  Hi you most honorable cluster folks!
>>
>> Our company is moving from Oracle to PostgreSQL.  We initially thought we
>> would be moving to MySQL Cluster, but an investigation of how clustering
>> works in MySQL Cluster revealed that performance would suffer substantially
>> since it is predicated on keys that segregate SQL-requested data to
>> specific nodes and not to all or most of the nodes.  A highly normalized
>> database would suffer in this situation where a result set would normally
>> consist of rows gathered from most, if not all, of the back-end nodes.
>>
>> Do you all have the same problem with Clustered PostgreSQL (Postgres-XC)?
>>
>> Respectfully Yours,
>>
>> Michael Vitale
>> ARIN DBA
>> mic...@ar...
>> 703-227-9885
>>
>>
>>
>> ------------------------------------------------------------------------------
>> For Developers, A Lot Can Happen In A Second.
>> Boundary is the first to Know...and Tell You.
>> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
>> http://p.sf.net/sfu/Boundary-d2dvs2
>> _______________________________________________
>> Postgres-xc-general mailing list
>> Pos...@li...
>> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>>
>>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EntepriseDB Corporation
> The Enterprise Postgres Company
>
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Postgres-xc-general mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general
>
>

-- 
Michael Paquier
http://michael.otacoo.com