Re: [Postgres-xc-general] Composite table types, replicate + distribute.

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> In terms of how difficult it is to integrate into core/vs using other
> middleware to achieve HA properties - I don't think it's easily to
> come up with an answer. (atleast one that isn't highly opinionated)
> I spent a few days building an XC cluster with streaming replication
> for each datanode + scripting failover events and recovery etc.
> The main issues I found were along the lines of lack of integration
> effectively. Configuring each datanode with different wal archive
> stores and recovery commands is very painful and difficult to
> understand the implications of.
> I did make an attempt at fixing this with even more middleware
> (pgpool+repmgr) but gave up after deciding that it's far too many
> moving parts for a DBMS for me to consider using it.
> I just can't see how so many pieces of completely disparate software
> can possibly know enough about the state of the system to make
> reasonable decisions with my data, which leaves me with developing my
> own manager to control them all..
> Streaming replication is also quite limited as it allows you to
> replicate entire nodes only.
>
> But enough opinion. Some facts from current DBMS that are using
> similar replication strategies.
> I say similar because none of them have quite the same architecture to XC.
>
> Cassandra[1] uses consistent hashing + a replica count to achieve both
> horizontal partitioning and replication for read/write scalability.
> This has some interesting challenges for them mostly stemming from the
> cluster size changing dynamically, dealing with maintaining consistent
> hashing rings and resilvering those.
> In my opinion this is made harder by the fact it uses cluster gossip
> without any node coordinator along with it's eventual consistency
> guarantees.
>
> Riak[2] also uses consistent hashing however based on a per 'bucket'
> basis where you can set a replication count.
>
> There are a bunch more too, like LightCloud, Voldemort, DynamoDB,
> BigTable, HBase etc.
>
> I appreciate these aren't RDBMS systems but I don't believe that is a
> big deal, it's perfectly viable to have a fully horizontal scaling
> RDBMS too, it just doesn't exist yet.
> Infact by having proper global transaction management I think this is
> made considerably easier and more reliable. Eventual consistency and
> no actual master node I don't think are good concessions to make.
> For the most part having a global picture of the state of all data is
> probably the biggest advantage of implementing this in XC vs other
> solutions.
>
> Oher major advantages are:
>
> a) Service impact from loss of datanodes is minimized (non-existent)
> in the case of losing only replica(s) using middleware requires an
> orchestrated failover
> b) Time to recovery (in terms of read performance) is reduced
> considerably because XC is able to implement a distributed recovery of
> out of date nodes
> c) Per table replication management (XC already has this but it would
> be even more valuable with composite partitioning)
> d) Increased read performance where replicas can be used to speed up
> read heavy workloads and lessen the impact of read hotspots.
> e) In band heartbeat can be used to determine fail-over requirements,
> no scripting or other points of failure.
> f) Components required to facilitate recovery could also be used to do
> online repartitioning (ie. increasing the size of the cluster)
> g) Probably the world's first real distributed RDBMS
>
> Obvious disadvantages are:
> a) Alot of work, difficult, hard etc. (this is actually the biggest
> barrier, there are lots of very difficult challenges in partitioning
> data)
> b) Making use of most of the features of said composite table
> partitioning is quite difficult, it would take a long time to optimize
> the query planner to make good use of them.
>
> There are probably more but would most probably require a proper
> devils advocate to reveal them (I am human and set it my opinions
> unfortunately)
>

Excellent research and summary Joseph!

The (a) in the disadvantages mentioned above really stands out. First
the work needs to be quantified in terms of how best to get HA going
and then it just needs to be done over whatever time period it takes.

However I believe we can mitigate some of the issues with (a) by using
a mixed approach of employing off-the-shelf technologies and then
modifying the core just so to make it amenable for them.

For example, the corosync/pacemaker stack is a very solid platform to
base HA work on. Have you looked at it and do you have any thoughts
around it?

And although you mentioned setting replicas as painful and cumbersome,
I think it's not such a "difficult" process really and can even be
automated. Having replicas for datanodes helps us do away with the
custom replication/partitioning strategy that you point out above. I
believe that also does away with some of the technical challenges that
it poses as you pointed out in the case of Cassandra above too. So
this can be a huge plus in terms of keeping things simple technology
wise.

Corosync/Pacemaker stack, replicas and focussed enhancements to the
core to enable sane behavior in case of failover seems to me to be a
simple and doable strategy.

Regards,
Nikhils
-- 
StormDB - http://www.stormdb.com
The Database Cloud