Re: [Postgres-xc-general] Composite table types, replicate + distribute.

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I also appreciate for excellent summary and findings of XC and HA
capabilities/potentials.

What I'm wondering on HA are:

1) Can we fix the target HA middleware for wider use?   I agree
Corosync/Pacemaker is very nice platform and it will be a good idea to
start with this.   Should we do some more work to support other HA
middleware, not necessarily open source?

2) Level of integration.    Is it sufficient to add resource agents
for HA middleware?   I'm wondering this could be too primitive.  Do we
need some more sophisticated configuration tools?   Is just monitoring
and automatic failover sufficient?   As pointed out, XC can continue
cluster operation even though some nodes are gone.   Can
corosync/pacemaker handle when XC cluster need to be shut down and
when it can continue operation?  (So far, I'm afraid it is too
complicated situation to be handled by corosync/pacemaker.)

3) Other middleware integration.    For example, do we need some more
tools to work with other operation support tools?

I believe we need much more ideas/experience/discussion on these
issues, while we begin with more primitive things.

I really appreciate for any further input on this.

Best Regards;
----------
Koichi Suzuki

2012/7/7 Nikhil Sontakke <ni...@st...>:
>> In terms of how difficult it is to integrate into core/vs using other
>> middleware to achieve HA properties - I don't think it's easily to
>> come up with an answer. (atleast one that isn't highly opinionated)
>> I spent a few days building an XC cluster with streaming replication
>> for each datanode + scripting failover events and recovery etc.
>> The main issues I found were along the lines of lack of integration
>> effectively. Configuring each datanode with different wal archive
>> stores and recovery commands is very painful and difficult to
>> understand the implications of.
>> I did make an attempt at fixing this with even more middleware
>> (pgpool+repmgr) but gave up after deciding that it's far too many
>> moving parts for a DBMS for me to consider using it.
>> I just can't see how so many pieces of completely disparate software
>> can possibly know enough about the state of the system to make
>> reasonable decisions with my data, which leaves me with developing my
>> own manager to control them all..
>> Streaming replication is also quite limited as it allows you to
>> replicate entire nodes only.
>>
>> But enough opinion. Some facts from current DBMS that are using
>> similar replication strategies.
>> I say similar because none of them have quite the same architecture to XC.
>>
>> Cassandra[1] uses consistent hashing + a replica count to achieve both
>> horizontal partitioning and replication for read/write scalability.
>> This has some interesting challenges for them mostly stemming from the
>> cluster size changing dynamically, dealing with maintaining consistent
>> hashing rings and resilvering those.
>> In my opinion this is made harder by the fact it uses cluster gossip
>> without any node coordinator along with it's eventual consistency
>> guarantees.
>>
>> Riak[2] also uses consistent hashing however based on a per 'bucket'
>> basis where you can set a replication count.
>>
>> There are a bunch more too, like LightCloud, Voldemort, DynamoDB,
>> BigTable, HBase etc.
>>
>> I appreciate these aren't RDBMS systems but I don't believe that is a
>> big deal, it's perfectly viable to have a fully horizontal scaling
>> RDBMS too, it just doesn't exist yet.
>> Infact by having proper global transaction management I think this is
>> made considerably easier and more reliable. Eventual consistency and
>> no actual master node I don't think are good concessions to make.
>> For the most part having a global picture of the state of all data is
>> probably the biggest advantage of implementing this in XC vs other
>> solutions.
>>
>> Oher major advantages are:
>>
>> a) Service impact from loss of datanodes is minimized (non-existent)
>> in the case of losing only replica(s) using middleware requires an
>> orchestrated failover
>> b) Time to recovery (in terms of read performance) is reduced
>> considerably because XC is able to implement a distributed recovery of
>> out of date nodes
>> c) Per table replication management (XC already has this but it would
>> be even more valuable with composite partitioning)
>> d) Increased read performance where replicas can be used to speed up
>> read heavy workloads and lessen the impact of read hotspots.
>> e) In band heartbeat can be used to determine fail-over requirements,
>> no scripting or other points of failure.
>> f) Components required to facilitate recovery could also be used to do
>> online repartitioning (ie. increasing the size of the cluster)
>> g) Probably the world's first real distributed RDBMS
>>
>> Obvious disadvantages are:
>> a) Alot of work, difficult, hard etc. (this is actually the biggest
>> barrier, there are lots of very difficult challenges in partitioning
>> data)
>> b) Making use of most of the features of said composite table
>> partitioning is quite difficult, it would take a long time to optimize
>> the query planner to make good use of them.
>>
>> There are probably more but would most probably require a proper
>> devils advocate to reveal them (I am human and set it my opinions
>> unfortunately)
>>
>
> Excellent research and summary Joseph!
>
> The (a) in the disadvantages mentioned above really stands out. First
> the work needs to be quantified in terms of how best to get HA going
> and then it just needs to be done over whatever time period it takes.
>
> However I believe we can mitigate some of the issues with (a) by using
> a mixed approach of employing off-the-shelf technologies and then
> modifying the core just so to make it amenable for them.
>
> For example, the corosync/pacemaker stack is a very solid platform to
> base HA work on. Have you looked at it and do you have any thoughts
> around it?
>
> And although you mentioned setting replicas as painful and cumbersome,
> I think it's not such a "difficult" process really and can even be
> automated. Having replicas for datanodes helps us do away with the
> custom replication/partitioning strategy that you point out above. I
> believe that also does away with some of the technical challenges that
> it poses as you pointed out in the case of Cassandra above too. So
> this can be a huge plus in terms of keeping things simple technology
> wise.
>
> Corosync/Pacemaker stack, replicas and focussed enhancements to the
> core to enable sane behavior in case of failover seems to me to be a
> simple and doable strategy.
>
> Regards,
> Nikhils
> --
> StormDB - http://www.stormdb.com
> The Database Cloud
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Postgres-xc-general mailing list
> Pos...@li...
> https://lists.sourceforge.net/lists/listinfo/postgres-xc-general