Re: [Postgres-xc-general] Data Node Scan Performance

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Also, if it helps, this is actually an operational query, not a data warehouse type query.  In this case, we have an object that owns a considerable amount of data below it (in this case, 1:N:M relationships where rows in table A have children in table B and grandchildren in table C).  The operation I want to perform is a scalable clone of the data.  The actually query looks more like the following:

INSERT INTO MyTable SELECT 500 as Id, Foo, Bar from MyTable WHERE Id = $1

For testing purposes (I'm simply tinkering), the data is distributed on Foo and indexed on Id.  So in this case I'm copying the "values" from the old entity to the new entity.  The plan appears to require the data to be retrieved into the coordinator and then dispersed back down into the data nodes.

An optimization for this specific problem would be to push the 'INSERT INTO ... SELECT' directly down into the data nodes since there isn't any inherent benefit to the coordinator consuming the data.  The distribution key is identical - so the data will be sent right back to the data node it came from.

Any other thoughts on how to make this performant - if not, I'll go back to my tinkering table.

________________________________
From: Aaron Jackson [aja...@re...]
Sent: Tuesday, April 29, 2014 12:06 AM
To: pos...@li...
Subject: [Postgres-xc-general] Data Node Scan Performance

I have a table that I've distributed by some key K.  When I want to query by some other dimension Q, the coordinator explain plan indicates that it does a Data Node Scan on *table* "_REMOTE_TABLE_QUERY"

Now what I've noticed is that if I have 4 nodes, the coordinator based scan may take 65 seconds, however, the individual date nodes usually finish within 5-10 seconds.  The individual explain plains from each data node reveal nothing.  So my question is, does the coordinator execute the data node scan serially or in parallel - and if it's serially, is there any thought around how to make it parallel?

In the event it is already parallel, is the time differential I'm seeing simply attributed to the coordinator gathering results in preparation to return to the requesting client?

Thanks