Re: [Postgres-xc-general] Composite table types, replicate + distribute.

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 7 July 2012 06:55, Nikhil Sontakke <ni...@st...> wrote:
>> In terms of how difficult it is to integrate into core/vs using other
>> middleware to achieve HA properties - I don't think it's easily to
>> come up with an answer. (atleast one that isn't highly opinionated)
>> I spent a few days building an XC cluster with streaming replication
>> for each datanode + scripting failover events and recovery etc.
>> The main issues I found were along the lines of lack of integration
>> effectively. Configuring each datanode with different wal archive
>> stores and recovery commands is very painful and difficult to
>> understand the implications of.
>> I did make an attempt at fixing this with even more middleware
>> (pgpool+repmgr) but gave up after deciding that it's far too many
>> moving parts for a DBMS for me to consider using it.
>> I just can't see how so many pieces of completely disparate software
>> can possibly know enough about the state of the system to make
>> reasonable decisions with my data, which leaves me with developing my
>> own manager to control them all..
>> Streaming replication is also quite limited as it allows you to
>> replicate entire nodes only.
>>
>> But enough opinion. Some facts from current DBMS that are using
>> similar replication strategies.
>> I say similar because none of them have quite the same architecture to XC.
>>
>> Cassandra[1] uses consistent hashing + a replica count to achieve both
>> horizontal partitioning and replication for read/write scalability.
>> This has some interesting challenges for them mostly stemming from the
>> cluster size changing dynamically, dealing with maintaining consistent
>> hashing rings and resilvering those.
>> In my opinion this is made harder by the fact it uses cluster gossip
>> without any node coordinator along with it's eventual consistency
>> guarantees.
>>
>> Riak[2] also uses consistent hashing however based on a per 'bucket'
>> basis where you can set a replication count.
>>
>> There are a bunch more too, like LightCloud, Voldemort, DynamoDB,
>> BigTable, HBase etc.
>>
>> I appreciate these aren't RDBMS systems but I don't believe that is a
>> big deal, it's perfectly viable to have a fully horizontal scaling
>> RDBMS too, it just doesn't exist yet.
>> Infact by having proper global transaction management I think this is
>> made considerably easier and more reliable. Eventual consistency and
>> no actual master node I don't think are good concessions to make.
>> For the most part having a global picture of the state of all data is
>> probably the biggest advantage of implementing this in XC vs other
>> solutions.
>>
>> Oher major advantages are:
>>
>> a) Service impact from loss of datanodes is minimized (non-existent)
>> in the case of losing only replica(s) using middleware requires an
>> orchestrated failover
>> b) Time to recovery (in terms of read performance) is reduced
>> considerably because XC is able to implement a distributed recovery of
>> out of date nodes
>> c) Per table replication management (XC already has this but it would
>> be even more valuable with composite partitioning)
>> d) Increased read performance where replicas can be used to speed up
>> read heavy workloads and lessen the impact of read hotspots.
>> e) In band heartbeat can be used to determine fail-over requirements,
>> no scripting or other points of failure.
>> f) Components required to facilitate recovery could also be used to do
>> online repartitioning (ie. increasing the size of the cluster)
>> g) Probably the world's first real distributed RDBMS
>>
>> Obvious disadvantages are:
>> a) Alot of work, difficult, hard etc. (this is actually the biggest
>> barrier, there are lots of very difficult challenges in partitioning
>> data)
>> b) Making use of most of the features of said composite table
>> partitioning is quite difficult, it would take a long time to optimize
>> the query planner to make good use of them.
>>
>> There are probably more but would most probably require a proper
>> devils advocate to reveal them (I am human and set it my opinions
>> unfortunately)
>>
>
> Excellent research and summary Joseph!
>
> The (a) in the disadvantages mentioned above really stands out. First
> the work needs to be quantified in terms of how best to get HA going
> and then it just needs to be done over whatever time period it takes.
>
> However I believe we can mitigate some of the issues with (a) by using
> a mixed approach of employing off-the-shelf technologies and then
> modifying the core just so to make it amenable for them.
>
> For example, the corosync/pacemaker stack is a very solid platform to
> base HA work on. Have you looked at it and do you have any thoughts
> around it?

Yes, I have worked on serveral projects that use it as a messaging
layer and think it's a great base. :)

>
> And although you mentioned setting replicas as painful and cumbersome,
> I think it's not such a "difficult" process really and can even be
> automated. Having replicas for datanodes helps us do away with the
> custom replication/partitioning strategy that you point out above. I
> believe that also does away with some of the technical challenges that
> it poses as you pointed out in the case of Cassandra above too. So
> this can be a huge plus in terms of keeping things simple technology
> wise.
>
> Corosync/Pacemaker stack, replicas and focussed enhancements to the
> core to enable sane behavior in case of failover seems to me to be a
> simple and doable strategy.

Are you suggesting something along the lines of full node replication
using streaming replication but managed by XC?
I think that is most definitely a decent place to start, it's alot
less radical but provides a large number of the aforementioned
benefits for less effort.
If XC is fully aware of the replication it can also use the standby
datanodes are read-slaves with very little work.
I might explore how easy that is to implement this weekend.

>
> Regards,
> Nikhils
> --
> StormDB - http://www.stormdb.com
> The Database Cloud

Joseph.

-- 
CTO | Orion Virtualisation Solutions | www.orionvm.com.au
Phone: 1300 56 99 52 | Mobile: 0428 754 846