From: Mark L. <ml...@re...> - 2009-01-30 12:46:39
|
Hi Paul. Sorry for the delay in replying ... On 9 Dec 2008, at 18:33, Paul Ferraro wrote: >> >> OK, so failures are agreed across all clients and all members of the >> replica group? What's the protocol for rejoining the group? Does >> there >> have to be a quiescent period, for instance? > > Database deactivations (i.e., the action taken in response to singular > failures) are broadcast to all client nodes. This is not so much an > agreement as it is a decree - it cannot be vetoed. A replica group > member (i.e. a database) has no knowledge of the clients, nor other > replica group members. > Rejoining the replica group (i.e. for a database to become active > again), does indeed require a quiescent period. This is enforced by a > distributed read-write lock. On the whole database or just on the data that needs updating? The former is obviously easier but leads to more false locking, whereas the latter can be harder to ensure but does allow for continuous use of the remaining database tables (cf page level locking versus object level locking.) > New transactions (both local and XA) > acquire a local read locks on begin and release on commit/rollback. Locking what? > The > acquisition of a distributed write lock effectively blocks any > database > writes across the whole cluster, while still allowing database reads. OK, so this sounds like it's a lock on the database itself rather than any specific data items. However, it also seems like you're not using the traditional exclusive writer, concurrent readers. So how do you prevent dirty reads of data in that case? Are you assuming optimistic locking? > > The activation process is: > 1. Acquire distributed write lock > 2. Synchronize against an active database using a specified > synchronization strategy > 3. Broadcast an activation message to all client nodes I don't quite get the need for 3, but then this may have been covered before. Why doesn't the newly activated replica simply rejoin the group? Aren't we just using the groupid to communicate and not sending messages from clients to specific replicas, i.e., multicast versus multiple unicasts? (BTW, I'm assuming "client" here means database user.) > > 4. Release distributed write lock > > N.B. Suspended transactions will block the activation process, since a > newly joined replica will have missed the prepare phase. > Heuristically completed transactions (as defined by the XAResource > proxy > (discussed below) will also block the activation process. OK, that sounds good. >>> >>> This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC >>> exposes a single XAResource for the whole database cluster to the >>> transaction manager, while internally it manages an XAResource per >>> database. >> >> OK, so this single XAResource acts like a transaction coordinator >> then? With a log? > > Nope. The XAResource HA-JDBC exposes to the transaction manager is > merely a proxy to a set of underlying XAResources (one per database) > that implements HA behavior. It has no log. I'll go into detail > about > how this HA logic handles heuristic decisions later... OK, I'll wait ;-) > > >> Plus, the TM will not necessarily go through 2PC, >> but may in fact use a 1PC protocol. The problem here is that there >> are >> actually multiple resources registered (c.f. interposition or >> subordinate coordinators). The failures that can be returned from a >> 1PC interaction on an XAResource are actually different than those >> that can be returned during the full 2PC. What this means is that if >> you run into these failures behind your XAResource veneer, you will >> be >> incapable of giving the right information (right error codes and/or >> exceptions) back to the transaction manager and hence to the >> application. > > 1PC is any different from the perspective of HA-JDBC's XAResource > proxy. > For a given method invocation, HA-JDBC's XAResource proxy coordinates > the results from the individual XAResources it manages, deactivating > the > corresponding database if it responds anomalously (including > exceptions > with anomalous error codes). Yes - 1PC commits can throw additional > XA_RB* exceptions, but the corresponding HA logic is algorithmically > the > same. No, you misunderstand. If you are proxying as you mentioned, then you're effectively acting as a subordinate coordinator. Your XAResource is hiding 1..N different XAResource instances. In the case of the transaction manager, it sees only your XAResource. This means that if there are no other XAResources involved in the transaction, the coordinator is not going to drive your XAResource through 2PC: it'll call commit(true) and trigger the one-phase optimization. Now your XAResource knows that there are 1..N instances in reality that need coordinating. Let's assume it's 2..N because 1..1 is the degenerate case and is trivial to manage. This means that your XAResource is a subordinate coordinator and must run 2PC on behalf of the root coordinator. But unfortunately the route through the 2PC protocol allows for a number of different error conditions than the route through the 1PC termination protocol. So your XAResource will have to translate 2PC errors, for example, into 1PC (and sometimes that mapping is not one-to-one.) Do you handle that? > > >> As a slight aside: what do you do about non-deterministic SQL >> expressions such as may be involved with the current time? > > HA-JDBC can be configured to replace non-deterministic SQL expressions > with client generated values. This behavior is enabled per expression > type (e.g. eval-current-timestamp="true"): > http://ha-jdbc.sourceforge.net/doc.html#cluster > Of course, to avoid the cost of regular expression parsing, it's > best to > avoid these functions altogether. Yes, but if we want to opaquely replace JDBC with HA-JDBC then we need to assume that the client won't be doing anything to help us. > > >> Plus have >> you thought about serializability or linearizability issues across >> the >> replicas when multiple clients are involved? > > Yes. The biggest headaches are sequences and identity columns. > Currently, sequences next value functions and inserts into tables > containing identity columns require distributed synchronous access > (per > sequence name, or per table name). This is very cumbersome, so I > generally encourage HA-JDBC users to use alternative, more > cluster-friendly identifier generation mechanisms. The cost for > sequences can be mitigated somewhat using a hi-lo algorithm (e.g. > SequenceHiLoGenerator in Hibernate). > For databases where DatabaseMetaData.supportsGetGeneratedKeys() = > true, > I plan to implement a master-slave approach, where the sequence next > value is executed against the master database, and alter sequence > statement are executed against the slaves. This is a cleaner than the > current approach, though vendor support for > Statement.getGeneratedKeys() > is still fairly sparse. What are the performance implications so far and with your future plans? > > >>> Consequently, HA-JDBC can tolerate asymmetric failures in any >>> phase of a transaction (i.e. a phase fails against some databases, >>> but >>> succeeds against others) without forcing the whole transaction to >>> fail, >>> or become in-doubt, by pessimistically deactivating the failed >>> databases. >> >> I'm not so convinced that the impact on the transaction is so opaque, >> as I mentioned above. Obviously what we need to do is make a replica >> group appear to be identical to a single instance only more >> available. >> Simply gluing together multiple instances behind a single XAResource >> abstraction doesn't do it. Plus what about transaction recovery? By >> hiding the fact that some replicas may have generated heuristic >> outcomes whereas others haven't (because of failures), the >> transaction >> manager may go ahead and delete the log because it thinks that the >> transaction has completed successfully. That would be a very bad >> thing >> to do (think of crossing the streams in Ghostbusters ;-) Simply >> overwriting the failed heuristic replica db with a fresh copy from a >> successful db replica would also be a very bad thing to do (think >> letting the Keymaster and Gozer meet in Ghostbusters) because >> heuristics need to be resolved by the sys admin and the transaction >> log should not vanish until that has happened. So some kinds of >> failures of dbs need to be treated differently than just deactivating >> them. Can you cope with that? > > Perhaps I should go into more detail about what happens behind the > XAResource proxy. One of the primary responsibilities of the > XAResource > proxy is to coordinate the return values and/or exceptions from a > given > method invocation - and present them to the caller (in this case, the > transaction manager) as a single return value or exception. Yes, I kind of figured that, since it's a subordinate coordinator (see above). But if you're not recording these outcomes durably then your proxy is going to cause problems itself for transaction integrity. Or can you explain to me what happens in the situation where the proxy has obtained some outcomes from the 1..N XAResources but not all and then it crashes? > > > Let's look at a couple of result matrices of return values and error > codes for each XAResource for a replica group of 2 databases and the > coordinated result from the HA-JDBC XAResource proxy: > N.B. I've numbered each scenario for discussion purposes. > I've used wildcard (*) in some error codes for brevity > Exception codes are distinguished from return values using [] > > prepare(Xid): > > REF DB1 DB2 HA-JDBC > 1 XA_OK XA_OK XA_OK > 2 XA_RDONLY XA_RDONLY XA_RDONLY > 3 XA_OK XA_RDONLY XA_OK, deactivate db2 This should mean more than deactivating the database. If one resource thinks that there was no work done on the data it controls whereas the other does, that's a serious issue and you should rollback! commit isn't always the right way forward! This is different to some of the cases below because there are no errors reported from the resource. As such the problem is at the application level, or somewhere in the middleware stack. That fact alone means that anything the committable resource has done is suspect immediately. > > 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 > 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 > 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) > 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), > rollback and deactivate db1 > 8 XA_OK [XAER_*] XA_OK, deactivate db2 > 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 > 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 > 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > same Hmmm, well that's not good either, because XAER_* errors can mean very different things (in the standard as well as in the implementations). So simply assuming that all of the databased are still in a position where they can be treated as a single instance is not necessarily a good idea. I understand all of what you're trying to do (it's just using the available copies replication protocol, which has been around for several decades.) But even that required a level of durability at the recipient in order to deal with failures between it and the client. The recipient in this case is your XAResource proxy. > > > In summary, when the return values/exceptions from each database do > not > match, HA-JDBC makes a heuristic decision of it's own. HA-JDBC makes > the reasonable assumption that the JDBC client (i.e. application) > expects a given database transaction to succeed, therefore, an XA_OK > is > preferable to a rollback request (i.e. XA_RB* exception). +1 > > > commit(Xid, false): > > REF DB1 DB2 HA-JDBC > 1 void void void > 2 void [XA_HEURCOM] void, forget db2 What happens if the call to forget fails, or your proxy crashes before it can be made? > > 3 void [XA_HEURRB] void, forget and deactivate db2 > 4 void [XA_HEURMIX] void, forget and deactivate db2 > 5 void [XA_HEURHAZ] void, forget and deactivate db2 Same question as for 2. > > 6 void [XAER_*] void, deactivate db2 > 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] > 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 > > 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate db2 This is more concerning because you've got a mixed heuristic outcome which means that this resource was probably a subordinate coordinator itself or modifying multiple entries in the database. That'll make automatic recovery interesting unless you expect to copy one entire database image over the top of another? > > 10 [XA_HEURCOM] [XA_HEURHAZ] [XA_HEURCOM], forget and deactivate db2 As above, though this time we don't know what state db2 is in. > > 11 [XA_HEURCOM] [XAER_*] [XA_HEURCOM], deactivate db2 > 12 [XA_HEURRB] [XA_HEURRB] [XA_HEURRB] > 13 [XA_HEURRB] [XA_HEURMIX] [XA_HEURRB], forget and deactivate db2 > 14 [XA_HEURRB] [XA_HEURHAZ] [XA_HEURRB], forget and deactivate db2 > 15 [XA_HEURRB] [XAER_*] [XA_HEURRB], deactivate db2 > 16 [XA_HEURMIX] [XA_HEURMIX] [XA_HEURMIX], forget and deactivate > all but one db > 17 [XA_HEURMIX] [XA_HEURHAZ] [XA_HEURMIX], forget and deactivate db2 > 18 [XA_HEURMIX] [XAER_*] [XA_HEURMIX], deactivate db2 > 19 [XA_HEURHAZ] [XA_HEURHAZ] [XA_HEURHAZ]. forget and deactivate > all but one db > 20 [XA_HEURHAZ] [XAER_*] [XA_HEURHAZ], deactivate db2 > 21 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > same > > commit(Xid, true): > N.B. There are no heuristic outcomes in the 1PC case, since there is > no > prepare. Erm, not true as I mentioned above. If your XAResource proxy is driven through 1PC and there is more than one database then you cannot simply drive them through 1PC - you must run 2PC and act as a subordinate. If you do not then you are risking more heuristic outcomes and still your XAResource proxy will require durability. > > > DB1 DB2 HA-JDBC > void void void > void XA_RB* void, deactivate db2 > XA_RB* XA_RB* XA_RB* (only if error code matches) > XA_RBBASE XA_RBEND XA_RBEND (prefer higher error code), rollback > and deactivate db1 > > In general, the HA logic picks the best outcomes, auto-forgets > heuristic > decisions that turn out to be correct, aggressively deactivates the > deviants from the replica group, and tolerates at most 1 > incorrect/mixed/hazard heuristic decision to bubble back to the > transaction manager. > > I don't think that your Ghostbusters analogy is very accurate. Hiding > heuristic decisions (e.g. #2-5) from the transaction manager is not > dangerous in the realm of HA-JDBC. If a heuristic decision turns > out to > be incorrect, then we can deactivate that database. It will get > re-sync'ed before it is reactivated, so the fact that a heuristic > decision was made at all is no longer relevant. Nope. And this is what concerns me. You need to understand what heuristics are about and why they have been developed over the past 4 decades. The reason that transaction managers don't ignore them and try to "make things right" (or hide them) is because we don't have the necessary semantic information to resolve them automatically. You're saying that you mask a heuristic database and take it out of the group so it won't cause problems. Later it'll be fixed by having the state overwritten by some good copy. But you are ignoring the fact that you do not have recovery capabilities built in to your XAResource as far as I can tell and are assuming that calling forget and/or deactivating the database will limit the problem. In reality, particularly with heuristics that cause the resource to commit or roll back, there may be systems and services tied in to those database at the back end which are triggered on commit/rollback and go and do something else (e.g., on rollback trigger the sending of a bad credit report on the user). In the real world, the TM would propagate the heuristic back to the end user and the sys admin would then go and either stop the credit report from being sent, or nullify it someone. How can he do that? Because he understands the semantics of the operation that was being executed. Does simply making the table entry the same as in one of the copies that didn't throw the heuristic help here? No of course it doesn't. It's just part of the solution (and if you were the person with the bad credit report, I think you'd want the whole compensation not just a part.) > Before the resync - we > recover and forget any prepared/heuristically completed branches. See above. It doesn't help. Believe me: I've been in your position before, back in the early 1990's ;-) > If > the XAResource proxy itself throws an exception indicating an invalid > heuristic decision, only then do we let on to the transaction manager. > In this case, we are forced to resolve the heuristic, either > automatically (by the transaction manager) The transaction manager will not do this automatically in most cases. > or manually (by and admin) > before any deactivated databases are reactivated - to protect data > consistency. > Remember the end of Ghostbusters - crossing the streams has its uses > and > is often the only way to send Gozer back to her dimension. Yes, but you'd also be sending most of New York back with her ;) > > > Transaction recovery should work fine. Confidence ;-) Have you tried it? We have a very thorough test suite for JBossTS that's been built up over the past 12 years (in Java). More than 50% of it is aimed at crash recovery tests, so it would be good to try it out. > HA-JDBC's > XAResource.recover(int) will return the aggregated set of Xids from > each > resource manager. What happens if some replicas are unavailable during recovery? Do you exclude them? > HA-JDBC currently assumes that the list of prepared > and heuristically completed branches is the same for each active > database. This is only true because of the protected-commit > functionality, and the single tolerance policy of HA-JDBC's XAResource > for incorrect/mixed/hazard heuristic outcomes. > >>> >>> >>>>> >>>>> HA-JDBC also considers failure of the JDBC client itself, e.g. the >>>>> application server. A client that fails - then comes back online >>>>> must >>>>> get its state (i.e. which databases are presently active) either >>>>> from >>>>> one of the other client nodes (in the distributed case) or from a >>>>> local >>>>> store (in the non-distributed case). >>>> >>>> >>>> OK, but what if it fails during an update? >>> >>> During an update of what? >> >> The client is sending an update request to the db and fails. How does >> it figure out the final outcome of the request? What about partial >> updates? How does it know? >> > > I think I already answered this question below when discussing failure > of a client node in the during transaction commit/rollback. > >>> >>> >>>> How are you achieving consensus across the replicas in this case? >>> >>> By replica, I assume you mean database? >> >> Yes. > > Again, discussed below. Consensus is achieved via distributed > acknowledgements. Well hopefully we all know that consensus is impossible to achieve in an asynchronous environment with failures, correct :-) ? > > >>> There is no communication >>> between replicas - only between HA-JDBC clients. When HA-JDBC >>> starts - >>> the current database cluster state (i.e. which database are >>> active) is >>> fetched from the JGroups group coordinator. If it's the first >>> member, >>> or if not running in distributed mode, then the state is fetched >>> from a >>> local store (currently uses java.util.prefs, though this will be >>> pluggable in the future). >>> >>>> For instance, the client says "commit" to the replica group (I >>>> assume >>>> that the client has a group-like abstraction for the servers so >>>> it's >>>> not dealing with them all directly?) >>> >>> Which client are you talking about? >> >> The user of the JDBC connection. > > Yes - there's a group-like abstraction for the servers. OK, so I'm even more confused than I was given what you said way back (that made me ask the unicast/multicast question). > > >>> The application client doesn't know >>> (or care) whether its accessing a single database (via normal JDBC) >>> or a >>> cluster of databases (via HA-JDBC). >> >> Well I agree in theory, but as I've pointed out so far it may very >> well have to care if the information we give back is not the same as >> in the single case. >> >>> HA-JDBC is instrumented completely >>> behind the standard JDBC interfaces. >>> >>>> and there is a failure of the client and some of the replicas >>>> during >>>> the commit operation. What replica consistency protocol are you >>>> using >>>> to enforce the state updates? >>> >>> Ah - this is discussed below. >>> >>>> Plus how many replicas do you need to tolerate N failures (N+1, 2N >>>> +1, >>>> 3N+1)? >>>> >>> >>> N+1. If only 1 active database remains in the cluster, then HA-JDBC >>> behaves like a normal JDBC driver. >> >> So definitely no network partition failures then. > > I do attempt to handle them as best as I can. > There are two scenarios we consider: > 1. Network partition affects connectivity to all databases > 2. Network partition affects connectivity to all but one database > (e.g. > one database is local) > > The first is actually easier, because the affect to all replicas is > the > same. In general, HA-JDBC doesn't ever deactivate all databases. The > client would end up seeing SQLExceptions until the network is > resolved. What about multiple clients? So far you've really been talking about one client using a replica group for JDBC. That's pretty easy to solve, relative to concurrent clients. Especially when you throw in network partitions. > > > The seconds scenario is problematic. Consider an environment of two > application server nodes, No, let's talk about more than 2. If you only do 2 then it's better to consider a primary copy or coordinator-cohort approach rather than true active replication. > and an HA-JDBC cluster of 2 databases, where > the databases are co-located on each application server machine. > During > a network partition, each application server sees that its local > database is alive, but connectivity to the other database is down. > Rather than allowing each server to proceed writing to its local > database, creating a synchronization nightmare, (split brain syndrome is the term :-) > the HA-JDBC driver goes > into cluster panic, and requires manual resolution. Luckily, this > isn't > that common, since a network partition likely affects connectivity to > the application client as well. We are really only concerned with > database write requests in process when the partition occurs. Well I'm more concerned with the case where we have concurrent clients who see different replica groups because of network partitioning. So let's assume we have 3 replicas and they're not co-located with any application server. Then assume that we have a partition that splits them 2:1 so that one client sees 1 replica and assumes 2 failures and vice versa. > > >>>> There's the application client, but the important "client" for this >>>> failure is the coordinator. How are you assuming this failure mode >>>> will be dealt with by the coordinator if it can't determine who did >>>> what when? >>> >>> Let me elaborate... >>> If protected-commit mode is enabled, then then commits and rollbacks >>> will trigger a 2-phase acknowledgement broadcast to the other >>> nodes of >>> the JGroups application cluster. >> >> Great. Who is maintaining the log for that then? Seems to me that >> this >> is starting to look more and more like what a transaction coordinator >> would do ;-) > > Essentially, that's what the XAResource is doing - coordinating the > atomic behavior resource managers that it proxies. The advantage is - > instead of reporting mixed/hazard heuristic outcomes due to in-doubt > transactions - we can report a definitive outcome to the transaction > manager, since the in-doubt resource managers can be deactivated. No you can't. Not unless you've got durability in the XAResource proxy. Do you? If not, can you explain how you can return a definitive answer when the proxy fails? > > >> >>> >>> If transaction-mode="serial" then within the HA-JDBC driver, >>> commits/rollbacks will execute against each database >>> sequentially. In >>> this case, we send a series of notifications prior to the >>> commit/rollback invocation on each database; and a second series of >>> acknowledgements upon completion. If the application server >>> executing >>> the transaction fails in the middle, the other nodes will detect the >>> view change and check to see if there are any unacknowledged >>> commits/rollbacks. If so, the unacknowledged databases will be >>> deactivated. >> >> OK, but what about concurrent and conflicting updates on the >> databases >> by multiple clients? Is it possible that two clients start at either >> end of the replica list and meet in the middle and find that they >> have >> already committed conflicting updates because they did not know about >> each other? (A white board would be a lot easier to explain this.) > > This doesn't happen as the replica list uses consistent ordered across > multiple clients. What ordering protocol do you use? Causal, atomic, global? > > >>> If transaction-mode="parallel" then within the HA-JDBC driver, >>> commits/rollbacks will execute against each database in parallel. >>> In >>> this case, we send a single notification prior to the invocations, >>> and a >>> second acknowledgement after the invocations complete. If, on view >>> change, the other nodes find an unacknowledged commit/rollback, then >>> the >>> database cluster enters a panic state and requires manual >>> resolution. >> >> I assume parallel is non-deterministic ordering of updates too across >> replicas? > > Yes - this mode is useful (i.e. faster) when the likelihood of > concurrent updates of the same data is small. What do you do in the case of conflicts? > > >>> [OT] I'm considering adding a third "hybrid" transaction mode, where >>> transactions execute on first on one database, then the rest in >>> parallel. This mode seeks to address the risk of deadlocks with >>> concurrent updates affecting parallel mode, while enabling better >>> performance than serial mode, for clusters containing more than 2 >>> databases. This mode would benefit from fine-grained transaction >>> acknowledgements, while requiring less overhead than serial mode >>> when >>> cluster size > 2. >> >> What is the benefit of this over using a transaction coordinator to >> do >> all of this coordination for you in the first place? > > HA-JDBC is designed to function in both JTA and non-JTA environments. > The above logic will guard against in-doubt transactions in both > environments. I may have more to say on this once we get over the question of durability in your XAResource ;-) > > >>> >>> >>>> What about a complete failure of every node in the system, i.e., >>>> the >>>> application, coordinator and all of the JDBC replicas fails? In the >>>> wonderful world of transactions, we still need to guarantee >>>> consistency in this case. >>>> >>> >>> This is also handled. >>> Just as cluster state changes are both broadcast to other nodes and >>> recorded locally, so are commit/rollback acknowledgements. >> >> And heuristic failure? In fact, all responses from each XAResource >> operation? > > > Hopefully, I already answered this question above. If not, can you > elaborate? > I don't think we're quite finished yet ;-) Mark. --- Mark Little ml...@re... JBoss, a Division of Red Hat Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland). |