Re: [Postgres-xc-general] Composite table types, replicate + distribute.

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 4 July 2012 17:40, Michael Paquier <mic...@gm...> wrote:
>
>
> On Wed, Jul 4, 2012 at 2:31 PM, Joseph Glanville
> <jos...@or...> wrote:
>>
>> Hey guys,
>>
>> This is more of a feature request/question regarding how HA could be
>> implemented with PostgreXC in the future.
>>
>> Could it be possible to have a composite table type which could
>> replicate to X nodes and distribute to Y nodes in such a way that
>> atleast X copies of every row is maintained but the table is shareded
>> across Y data nodes.
>
> The answer is yes. It is possible.
>>
>>
>> For example in a cluster of 6 nodes one would be able configure at
>> table with REPLICATION 2, DISTRIBUTE 3 BY HASH etc (I can't remember
>> what the table definitions look like) as such that the table would be
>> replicated to 2 sets of 3 nodes.
>
> As you seem to be aware of, now XC only supports horizontal partitioning,
> meaning that tuples are present on each node in a complete form with all the
> column data.
> So let's call call your feature partial horizontal partitioning... Or
> something like this.

I prefer to think of it as true horizontal scaling rather than a form
of partitioning as partitioning is only part of what it would do. :)

>
>>
>> This is interesting becaues it can provide a flexible tradeoff between
>> full write scalability (current PostgresXC distribute) and full read
>> scalability (PostgresXC replicate or other slave solutions)
>> What is most useful about this setup is using PostgresXC this can be
>> maintained transparently without middleware and configured to be fully
>> sync multi-master etc.
>
> Do you have some example of applications that may require that?

The applications are no different merely the SLA/uptime requirements
and an overall reduction in complexity.

In the current XC architecture datanodes need to be highly available,
this change would shift the onus of high availability away from
individual datanodes to the coordinators etc.
The main advantage here is the reduction in moving parts and better
awareness of the query engine to the state of the system.

In theory if something along the lines of this could be implemented
you could use the below REPLICATE/DISTRIBUTE strategy to maintain
ability to service queries with up to 3 out of 6 servers down, as long
as you lost the right 3 ( the entirety of one DISTRIBUTE cluster).

As you are probably already aware current replication solutions for
Postgres don't play nicely with each other middleware as there hasn't
really been any integration up until now (streaming replcation is
starting to change this but its overall integration is still poor with
other middleware and applications)

>
>>
>>
>> Are there significant technical challenges to the above and is this
>> something the PostgresXC team would be interested in?
>
> The code would need to be changed at many places and might require some
> effort especially for cursors and join determination at planner side.
>
> Another critical choice I see here is related to the preferential strategy
> for node choice.
> For example, in your case, the table is replicated on 3 nodes, and
> distributed on 3 nodes by hash.
> When a simple read query arrives at XC level, we need to make XC aware of
> which set of nodes to choose in priority.
> A simple session parameter which is table-based could manage that though,
> but is it user-friendly?
> A way to choose the set of nodes automatically would be to evaluate with a
> global system of statistics the load on each table of read/write operations
> for each set of nodes and choose the set of nodes the less loaded at the
> moment query is fired when planning it. This is largely more complicated
> however.

This is true. My first thought was quite similar.
If you have the same example as above where one has a total of 6
datanodes, 2 sets of a 3 node distribute table you have 2 nodes that
can service each read request.
One could use a simple round robin approach to generate aforementioned
table which would look somewhat similar to below:

        |  shard1 | shard 2 | shard3
rep1 |     1       |     2      |     1
rep2 |     2       |     1      |     2

This would allow both online and offline optimisation by either
internal processes or manual intervention by the operator.
Being so simple it is very easy to autogenerate said table. For a HASH
style distribute read queries should be uniformly distributed across
shard replicas.

Personally I think the more complicated bit becomes restoring shard
replicas that have left the cluster for some time.
In my opinion it would be best to have XC do a row based restore
because XC has alot of information that could make this process very
fast.

Assuming the case where one has many replicas configured (say 3 or
more) read queries required to bring either an out of date replica up
to speed or a completely new and empty replica to up to date status
could be distributed across other replica members.

> --
> Michael Paquier
> http://michael.otacoo.com

I am aware that that the proposal is quite broad (from a technical
perspective) but more what I am trying to asertain is if it is in
conflict with the current XC's team vision.

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846