|
From: Joseph G. <jos...@or...> - 2012-07-06 21:33:53
|
On 7 July 2012 06:55, Nikhil Sontakke <ni...@st...> wrote: >> In terms of how difficult it is to integrate into core/vs using other >> middleware to achieve HA properties - I don't think it's easily to >> come up with an answer. (atleast one that isn't highly opinionated) >> I spent a few days building an XC cluster with streaming replication >> for each datanode + scripting failover events and recovery etc. >> The main issues I found were along the lines of lack of integration >> effectively. Configuring each datanode with different wal archive >> stores and recovery commands is very painful and difficult to >> understand the implications of. >> I did make an attempt at fixing this with even more middleware >> (pgpool+repmgr) but gave up after deciding that it's far too many >> moving parts for a DBMS for me to consider using it. >> I just can't see how so many pieces of completely disparate software >> can possibly know enough about the state of the system to make >> reasonable decisions with my data, which leaves me with developing my >> own manager to control them all.. >> Streaming replication is also quite limited as it allows you to >> replicate entire nodes only. >> >> But enough opinion. Some facts from current DBMS that are using >> similar replication strategies. >> I say similar because none of them have quite the same architecture to XC. >> >> Cassandra[1] uses consistent hashing + a replica count to achieve both >> horizontal partitioning and replication for read/write scalability. >> This has some interesting challenges for them mostly stemming from the >> cluster size changing dynamically, dealing with maintaining consistent >> hashing rings and resilvering those. >> In my opinion this is made harder by the fact it uses cluster gossip >> without any node coordinator along with it's eventual consistency >> guarantees. >> >> Riak[2] also uses consistent hashing however based on a per 'bucket' >> basis where you can set a replication count. >> >> There are a bunch more too, like LightCloud, Voldemort, DynamoDB, >> BigTable, HBase etc. >> >> I appreciate these aren't RDBMS systems but I don't believe that is a >> big deal, it's perfectly viable to have a fully horizontal scaling >> RDBMS too, it just doesn't exist yet. >> Infact by having proper global transaction management I think this is >> made considerably easier and more reliable. Eventual consistency and >> no actual master node I don't think are good concessions to make. >> For the most part having a global picture of the state of all data is >> probably the biggest advantage of implementing this in XC vs other >> solutions. >> >> Oher major advantages are: >> >> a) Service impact from loss of datanodes is minimized (non-existent) >> in the case of losing only replica(s) using middleware requires an >> orchestrated failover >> b) Time to recovery (in terms of read performance) is reduced >> considerably because XC is able to implement a distributed recovery of >> out of date nodes >> c) Per table replication management (XC already has this but it would >> be even more valuable with composite partitioning) >> d) Increased read performance where replicas can be used to speed up >> read heavy workloads and lessen the impact of read hotspots. >> e) In band heartbeat can be used to determine fail-over requirements, >> no scripting or other points of failure. >> f) Components required to facilitate recovery could also be used to do >> online repartitioning (ie. increasing the size of the cluster) >> g) Probably the world's first real distributed RDBMS >> >> Obvious disadvantages are: >> a) Alot of work, difficult, hard etc. (this is actually the biggest >> barrier, there are lots of very difficult challenges in partitioning >> data) >> b) Making use of most of the features of said composite table >> partitioning is quite difficult, it would take a long time to optimize >> the query planner to make good use of them. >> >> There are probably more but would most probably require a proper >> devils advocate to reveal them (I am human and set it my opinions >> unfortunately) >> > > Excellent research and summary Joseph! > > The (a) in the disadvantages mentioned above really stands out. First > the work needs to be quantified in terms of how best to get HA going > and then it just needs to be done over whatever time period it takes. > > However I believe we can mitigate some of the issues with (a) by using > a mixed approach of employing off-the-shelf technologies and then > modifying the core just so to make it amenable for them. > > For example, the corosync/pacemaker stack is a very solid platform to > base HA work on. Have you looked at it and do you have any thoughts > around it? Yes, I have worked on serveral projects that use it as a messaging layer and think it's a great base. :) > > And although you mentioned setting replicas as painful and cumbersome, > I think it's not such a "difficult" process really and can even be > automated. Having replicas for datanodes helps us do away with the > custom replication/partitioning strategy that you point out above. I > believe that also does away with some of the technical challenges that > it poses as you pointed out in the case of Cassandra above too. So > this can be a huge plus in terms of keeping things simple technology > wise. > > Corosync/Pacemaker stack, replicas and focussed enhancements to the > core to enable sane behavior in case of failover seems to me to be a > simple and doable strategy. Are you suggesting something along the lines of full node replication using streaming replication but managed by XC? I think that is most definitely a decent place to start, it's alot less radical but provides a large number of the aforementioned benefits for less effort. If XC is fully aware of the replication it can also use the standby datanodes are read-slaves with very little work. I might explore how easy that is to implement this weekend. > > Regards, > Nikhils > -- > StormDB - http://www.stormdb.com > The Database Cloud Joseph. -- CTO | Orion Virtualisation Solutions | www.orionvm.com.au Phone: 1300 56 99 52 | Mobile: 0428 754 846 |