|
From: Paul F. <pau...@re...> - 2009-02-10 22:46:38
|
On Fri, 2009-01-30 at 12:20 +0000, Mark Little wrote: > Hi Paul. Sorry for the delay in replying ... No worries. Thanks for the thought and time you've put into this conversation so far. > On 9 Dec 2008, at 18:33, Paul Ferraro wrote: > > >> > >> OK, so failures are agreed across all clients and all members of the > >> replica group? What's the protocol for rejoining the group? Does > >> there > >> have to be a quiescent period, for instance? > > > > Database deactivations (i.e., the action taken in response to singular > > failures) are broadcast to all client nodes. This is not so much an > > agreement as it is a decree - it cannot be vetoed. A replica group > > member (i.e. a database) has no knowledge of the clients, nor other > > replica group members. > > Rejoining the replica group (i.e. for a database to become active > > again), does indeed require a quiescent period. This is enforced by a > > distributed read-write lock. > > On the whole database or just on the data that needs updating? The > former is obviously easier but leads to more false locking, whereas > the latter can be harder to ensure but does allow for continuous use > of the remaining database tables (cf page level locking versus object > level locking.) > > > New transactions (both local and XA) > > acquire a local read locks on begin and release on commit/rollback. > > Locking what? A single ReadWriteLock (per logical cluster) within the HA-JDBC driver. > > The > > acquisition of a distributed write lock effectively blocks any > > database > > writes across the whole cluster, while still allowing database reads. > > OK, so this sounds like it's a lock on the database itself rather than > any specific data items. However, it also seems like you're not using > the traditional exclusive writer, concurrent readers. So how do you > prevent dirty reads of data in that case? Are you assuming optimistic > locking? I'm talking about a client-side RWLock, not a database lock. Remember, HA-JDBC is entirely client-side. The purpose of the lock is to isolate transaction boundaries, so that the synchronization process (through acquisition of the write side of this lock) can ensure that no transactions are in progress (including local auto-commits). This is the traditional single writer, concurrent readers - however, in this case the "writer" is the synchronization process, and the readers are transaction blocks. > > > > The activation process is: > > 1. Acquire distributed write lock > > 2. Synchronize against an active database using a specified > > synchronization strategy > > 3. Broadcast an activation message to all client nodes > > I don't quite get the need for 3, but then this may have been covered > before. Why doesn't the newly activated replica simply rejoin the > group? Aren't we just using the groupid to communicate and not sending > messages from clients to specific replicas, i.e., multicast versus > multiple unicasts? (BTW, I'm assuming "client" here means database > user.) Because the replica is not a group member. A few points of architectural clarification... I use the term "client node" to refer to a DatabaseCluster object (the highest abstraction in HA-JDBC). This is the object to which all end-user JDBC facades interact. The DatabaseCluster manages 1 or more Database objects - a simple encapsulation for the connectivity details for a specific database. The Database object comes in 4 flavors: Driver, DataSource, ConnectionPoolDataSource, and XADataSource When the client requests a connection from the HA-JDBC driver, HA-JDBC DataSource, or app server DataSource backed by an HA-JDBC ConnectionPoolDataSource or XADataSource, the object returned is a dynamic proxy backed by the DatabaseCluster, containing a map of Connections to each active database in the cluster. The Connection proxy begets Statement proxies, etc. One of the prime duties of the DatabaseCluster is managing the cluster state, i.e. the set of active databases. This is represented as a set of string identifiers. In the distributed use case, i.e. where the HA-JDBC config file defines the cluster as <distributable/> (a la web.xml), the cluster state is replicated all participating application servers via jgroups. The jgroups membership is the set of app servers accessing a given HA-JDBC cluster (not the replicas themselves). The activation process I described above occurs on a given node of the application server cluster. The activation message to which I referred tells each group member to update its state, e.g. from [db1, db2] -> [db1, db2, db3] > > > > 4. Release distributed write lock > > > > N.B. Suspended transactions will block the activation process, since a > > newly joined replica will have missed the prepare phase. > > Heuristically completed transactions (as defined by the XAResource > > proxy > > (discussed below) will also block the activation process. > > OK, that sounds good. > > >>> > >>> This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC > >>> exposes a single XAResource for the whole database cluster to the > >>> transaction manager, while internally it manages an XAResource per > >>> database. > >> > >> OK, so this single XAResource acts like a transaction coordinator > >> then? With a log? > > > > Nope. The XAResource HA-JDBC exposes to the transaction manager is > > merely a proxy to a set of underlying XAResources (one per database) > > that implements HA behavior. It has no log. I'll go into detail > > about > > how this HA logic handles heuristic decisions later... > > OK, I'll wait ;-) > > > > > > >> Plus, the TM will not necessarily go through 2PC, > >> but may in fact use a 1PC protocol. The problem here is that there > >> are > >> actually multiple resources registered (c.f. interposition or > >> subordinate coordinators). The failures that can be returned from a > >> 1PC interaction on an XAResource are actually different than those > >> that can be returned during the full 2PC. What this means is that if > >> you run into these failures behind your XAResource veneer, you will > >> be > >> incapable of giving the right information (right error codes and/or > >> exceptions) back to the transaction manager and hence to the > >> application. > > > > 1PC is any different from the perspective of HA-JDBC's XAResource > > proxy. > > For a given method invocation, HA-JDBC's XAResource proxy coordinates > > the results from the individual XAResources it manages, deactivating > > the > > corresponding database if it responds anomalously (including > > exceptions > > with anomalous error codes). Yes - 1PC commits can throw additional > > XA_RB* exceptions, but the corresponding HA logic is algorithmically > > the > > same. > > No, you misunderstand. If you are proxying as you mentioned, then > you're effectively acting as a subordinate coordinator. Your > XAResource is hiding 1..N different XAResource instances. In the case > of the transaction manager, it sees only your XAResource. This means > that if there are no other XAResources involved in the transaction, > the coordinator is not going to drive your XAResource through 2PC: > it'll call commit(true) and trigger the one-phase optimization. Now > your XAResource knows that there are 1..N instances in reality that > need coordinating. Let's assume it's 2..N because 1..1 is the > degenerate case and is trivial to manage. This means that your > XAResource is a subordinate coordinator and must run 2PC on behalf of > the root coordinator. But unfortunately the route through the 2PC > protocol allows for a number of different error conditions than the > route through the 1PC termination protocol. So your XAResource will > have to translate 2PC errors, for example, into 1PC (and sometimes > that mapping is not one-to-one.) Do you handle that? I see what you're saying now - but I disagree with your assertion that HA-JDBC's XAResource proxy "must run 2PC on behalf of the root coordinator" in the single phase optimization scenario. The point of 2PC is to ensure consensus that a unit of work can be committed across multiple resources. In HA-JDBC, since the resources manged by the proxy are presumed to be identical by definition, the prepare phase is unnecessary since we expect the response to be unanimous one way or the other. Therefore, the single phase optimization is entirely appropriate. > > > > > >> As a slight aside: what do you do about non-deterministic SQL > >> expressions such as may be involved with the current time? > > > > HA-JDBC can be configured to replace non-deterministic SQL expressions > > with client generated values. This behavior is enabled per expression > > type (e.g. eval-current-timestamp="true"): > > http://ha-jdbc.sourceforge.net/doc.html#cluster > > Of course, to avoid the cost of regular expression parsing, it's > > best to > > avoid these functions altogether. > > Yes, but if we want to opaquely replace JDBC with HA-JDBC then we need > to assume that the client won't be doing anything to help us. The client never needs to provide runtime hints on behalf of HA-JDBC. Non-deterministic expression evaluation is enabled/disabled in the HA-JDBC configuration file. > > > > > >> Plus have > >> you thought about serializability or linearizability issues across > >> the > >> replicas when multiple clients are involved? > > > > Yes. The biggest headaches are sequences and identity columns. > > Currently, sequences next value functions and inserts into tables > > containing identity columns require distributed synchronous access > > (per > > sequence name, or per table name). This is very cumbersome, so I > > generally encourage HA-JDBC users to use alternative, more > > cluster-friendly identifier generation mechanisms. The cost for > > sequences can be mitigated somewhat using a hi-lo algorithm (e.g. > > SequenceHiLoGenerator in Hibernate). > > For databases where DatabaseMetaData.supportsGetGeneratedKeys() = > > true, > > I plan to implement a master-slave approach, where the sequence next > > value is executed against the master database, and alter sequence > > statement are executed against the slaves. This is a cleaner than the > > current approach, though vendor support for > > Statement.getGeneratedKeys() > > is still fairly sparse. > > What are the performance implications so far and with your future plans? Currently, a given "next value for sequence" requires a jgroups rpc call to acquire an exclusive lock (per sequence) on each client node, and an subsequent rpc call to release the lock. The future approach doesn't improve performance (makes it slightly worse actually, by requiring an additional SQL statement), but rather reliability, since the first approach doesn't verify that the next sequence value was actually the same on each database, but trusts that it is, given its exclusive access. As I write this, it occurs to me that Statement.getGeneratedKeys() can still be leveraged to verify consistency without requiring separate ALTER SEQUENCE statements on the "slave" databases. So at this point, I need to think about this more. > >>> Consequently, HA-JDBC can tolerate asymmetric failures in any > >>> phase of a transaction (i.e. a phase fails against some databases, > >>> but > >>> succeeds against others) without forcing the whole transaction to > >>> fail, > >>> or become in-doubt, by pessimistically deactivating the failed > >>> databases. > >> > >> I'm not so convinced that the impact on the transaction is so opaque, > >> as I mentioned above. Obviously what we need to do is make a replica > >> group appear to be identical to a single instance only more > >> available. > >> Simply gluing together multiple instances behind a single XAResource > >> abstraction doesn't do it. Plus what about transaction recovery? By > >> hiding the fact that some replicas may have generated heuristic > >> outcomes whereas others haven't (because of failures), the > >> transaction > >> manager may go ahead and delete the log because it thinks that the > >> transaction has completed successfully. That would be a very bad > >> thing > >> to do (think of crossing the streams in Ghostbusters ;-) Simply > >> overwriting the failed heuristic replica db with a fresh copy from a > >> successful db replica would also be a very bad thing to do (think > >> letting the Keymaster and Gozer meet in Ghostbusters) because > >> heuristics need to be resolved by the sys admin and the transaction > >> log should not vanish until that has happened. So some kinds of > >> failures of dbs need to be treated differently than just deactivating > >> them. Can you cope with that? > > > > Perhaps I should go into more detail about what happens behind the > > XAResource proxy. One of the primary responsibilities of the > > XAResource > > proxy is to coordinate the return values and/or exceptions from a > > given > > method invocation - and present them to the caller (in this case, the > > transaction manager) as a single return value or exception. > > Yes, I kind of figured that, since it's a subordinate coordinator (see > above). But if you're not recording these outcomes durably then your > proxy is going to cause problems itself for transaction integrity. Or > can you explain to me what happens in the situation where the proxy > has obtained some outcomes from the 1..N XAResources but not all and > then it crashes? I've already described this (i.e. protected commit mode) at some other point in the thread. We record the result of the commits for each database as it completes. In the single client node case (i.e. non-distributed), this situation (i.e. partial commit completion) is detected when HA-JDBC restarts. In the distributed case, the commit results are also replicated to remote client nodes via jgroups. The remote nodes will detect the incomplete commits when the they detect the group membership change. > > Let's look at a couple of result matrices of return values and error > > codes for each XAResource for a replica group of 2 databases and the > > coordinated result from the HA-JDBC XAResource proxy: > > N.B. I've numbered each scenario for discussion purposes. > > I've used wildcard (*) in some error codes for brevity > > Exception codes are distinguished from return values using [] > > > > prepare(Xid): > > > > REF DB1 DB2 HA-JDBC > > 1 XA_OK XA_OK XA_OK > > 2 XA_RDONLY XA_RDONLY XA_RDONLY > > 3 XA_OK XA_RDONLY XA_OK, deactivate db2 > > This should mean more than deactivating the database. If one resource > thinks that there was no work done on the data it controls whereas the > other does, that's a serious issue and you should rollback! commit > isn't always the right way forward! This is different to some of the > cases below because there are no errors reported from the resource. As > such the problem is at the application level, or somewhere in the > middleware stack. That fact alone means that anything the committable > resource has done is suspect immediately. Good point. #3 is really an exception condition, since it violates HA-JDBC's assumption that all databases are identical. There should really be a more flexible mechanism for handling this. Perhaps a configuration option to determine how to HA-JDBC should resolve conflicts. Possible values for this would be: * optimistic: Assume that user transaction expects success (current behavior) * pessimistic: Rollback if possible, otherwise halt cluster and log conflict. * paranoid: Halt cluster, log conflict. * chain-of-command: Assume the result of the 1st available database is the authoritative response - deactivate any deviants A "chain-of-command" strategy is really a more appropriate default for HA-JDBC, as opposed to the current optimistic behavior. > > > > 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 > > 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 > > 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) > > 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), > > rollback and deactivate db1 > > 8 XA_OK [XAER_*] XA_OK, deactivate db2 > > 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 > > 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 > > 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > > same > > Hmmm, well that's not good either, because XAER_* errors can mean very > different things (in the standard as well as in the implementations). > So simply assuming that all of the databases are still in a position > where they can be treated as a single instance is not necessarily a > good idea. Good point. If the XAER_* error codes are not the same, I should respond using the configured conflict resolution strategy. Optimistic strategy would keep the database(s) with the highest error code (i.e. least severe error), and deactivate the others. Pessimistic/Paranoid strategy would halt the cluster. Chain-of-command strategy would return the XAER_* from db1 and deactivate db2. > I understand all of what you're trying to do (it's just using the > available copies replication protocol, which has been around for > several decades.) But even that required a level of durability at the > recipient in order to deal with failures between it and the client. > The recipient in this case is your XAResource proxy. This durability does exists both in the XAResource proxy and the Connection proxy (for non-JTA transactions) - see the section on protected commit mode. Isn't that sufficient? > > > > > > In summary, when the return values/exceptions from each database do > > not > > match, HA-JDBC makes a heuristic decision of it's own. HA-JDBC makes > > the reasonable assumption that the JDBC client (i.e. application) > > expects a given database transaction to succeed, therefore, an XA_OK > > is > > preferable to a rollback request (i.e. XA_RB* exception). > > +1 > > > > > > > commit(Xid, false): > > > > REF DB1 DB2 HA-JDBC > > 1 void void void > > 2 void [XA_HEURCOM] void, forget db2 > > What happens if the call to forget fails, or your proxy crashes before > it can be made? If forget fails, the error is logged and db2 is deactivated. If the proxy crashes, then we follow the "protected-commit" logic discussed earlier. > > > > > 3 void [XA_HEURRB] void, forget and deactivate db2 > > 4 void [XA_HEURMIX] void, forget and deactivate db2 > > 5 void [XA_HEURHAZ] void, forget and deactivate db2 > > Same question as for 2. If forget fails, we log the failure and proceed to deactivate db2. > > > > 6 void [XAER_*] void, deactivate db2 > > 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] > > 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 > > > > > 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate db2 > > This is more concerning because you've got a mixed heuristic outcome > which means that this resource was probably a subordinate coordinator > itself or modifying multiple entries in the database. That'll make > automatic recovery interesting unless you expect to copy one entire > database image over the top of another? During synchronization prior to database activation, all prepared and heuristically completed branches on the target database are forgotten. Synchronization will not occur (i.e. not be allowed to start, since it won't be able to acquire the DatabaseCluster's write lock described earlier) if the source database has any heuristically completed branches. > > > > 10 [XA_HEURCOM] [XA_HEURHAZ] [XA_HEURCOM], forget and deactivate db2 > > As above, though this time we don't know what state db2 is in. > > > > > 11 [XA_HEURCOM] [XAER_*] [XA_HEURCOM], deactivate db2 > > 12 [XA_HEURRB] [XA_HEURRB] [XA_HEURRB] > > 13 [XA_HEURRB] [XA_HEURMIX] [XA_HEURRB], forget and deactivate db2 > > 14 [XA_HEURRB] [XA_HEURHAZ] [XA_HEURRB], forget and deactivate db2 > > 15 [XA_HEURRB] [XAER_*] [XA_HEURRB], deactivate db2 > > 16 [XA_HEURMIX] [XA_HEURMIX] [XA_HEURMIX], forget and deactivate > > all but one db > > 17 [XA_HEURMIX] [XA_HEURHAZ] [XA_HEURMIX], forget and deactivate db2 > > 18 [XA_HEURMIX] [XAER_*] [XA_HEURMIX], deactivate db2 > > 19 [XA_HEURHAZ] [XA_HEURHAZ] [XA_HEURHAZ]. forget and deactivate > > all but one db > > 20 [XA_HEURHAZ] [XAER_*] [XA_HEURHAZ], deactivate db2 > > 21 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > > same > > > > commit(Xid, true): > > N.B. There are no heuristic outcomes in the 1PC case, since there is > > no > > prepare. > > Erm, not true as I mentioned above. If your XAResource proxy is driven > through 1PC and there is more than one database then you cannot simply > drive them through 1PC - you must run 2PC and act as a subordinate. If > you do not then you are risking more heuristic outcomes and still your > XAResource proxy will require durability. As I replied earlier - it's OK, I think, to drive the databases through 1PC. And, yes, durability does exist in XAResource proxy. > > > > > > DB1 DB2 HA-JDBC > > void void void > > void XA_RB* void, deactivate db2 > > XA_RB* XA_RB* XA_RB* (only if error code matches) > > XA_RBBASE XA_RBEND XA_RBEND (prefer higher error code), rollback > > and deactivate db1 > > > > In general, the HA logic picks the best outcomes, auto-forgets > > heuristic > > decisions that turn out to be correct, aggressively deactivates the > > deviants from the replica group, and tolerates at most 1 > > incorrect/mixed/hazard heuristic decision to bubble back to the > > transaction manager. > > > > I don't think that your Ghostbusters analogy is very accurate. Hiding > > heuristic decisions (e.g. #2-5) from the transaction manager is not > > dangerous in the realm of HA-JDBC. If a heuristic decision turns > > out to > > be incorrect, then we can deactivate that database. It will get > > re-sync'ed before it is reactivated, so the fact that a heuristic > > decision was made at all is no longer relevant. > > Nope. And this is what concerns me. You need to understand what > heuristics are about and why they have been developed over the past 4 > decades. I thought the whole point of making a heuristic decision was to improve resource availability by allow the resource to proceed with the second phase of a transaction branch according to the its response to first phase, in the event that the transaction manager fails to initiate the second phase within some pre-determined time - so that resources are not blocked forever. What other motivations are there? > The reason that transaction managers don't ignore them and > try to "make things right" (or hide them) is because we don't have the > necessary semantic information to resolve them automatically. On the contrary, HA-JDBC does have one vital piece of information that the transaction manager does not - the results from other identical resources to the same action. If one database made a heuristic decision to commit a transaction, while another database returns a successful commit (scenario #2), then I think HA-JDBC *does* have sufficient information to forget the heuristically completed branch and return normally. A transaction manager could not safely make this decision. Likewise, if one database made a heuristic decision to rollback a transaction, while another database successfully commit (scenario #3), then I think HA-JDBC has sufficient information to conclude that the heuristic decision was incorrect and can forget the heuristically completed branch on the first database, deactivate the first database, and return normally. > You're > saying that you mask a heuristic database and take it out of the group > so it won't cause problems. Later it'll be fixed by having the state > overwritten by some good copy. But you are ignoring the fact that you > do not have recovery capabilities built in to your XAResource as far > as I can tell and are assuming that calling forget and/or deactivating > the database will limit the problem. What recovery capabilities do you think are missing? > In reality, particularly with > heuristics that cause the resource to commit or roll back, there may > be systems and services tied in to those database at the back end > which are triggered on commit/rollback and go and do something else > (e.g., on rollback trigger the sending of a bad credit report on the > user). This is not a valid use case for HA-JDBC. One of HA-JDBC's requirements is that all access to the databases of the cluster go through the HA-JDBC driver. A trigger like this is unsupported since it would: a) have to exist on every database - and would result it N bad credit reports - BAD. b) Given that any database can very well become inactive at any time, you cannot rely on a trigger on a single database cluster member. Unfortunate, yes, but there's only so much I can support from the client side. > In the real world, the TM would propagate the heuristic back to > the end user and the sys admin would then go and either stop the > credit report from being sent, or nullify it someone. Like I said earlier, because HA-JDBC also has the results from other identical resources available, it can reasonably determine the correctness of a heuristic decision. > How can he do > that? Because he understands the semantics of the operation that was > being executed. Does simply making the table entry the same as in one > of the copies that didn't throw the heuristic help here? No of course > it doesn't. It's just part of the solution (and if you were the person > with the bad credit report, I think you'd want the whole compensation > not just a part.) > > > Before the resync - we > > recover and forget any prepared/heuristically completed branches. > > See above. It doesn't help. Believe me: I've been in your position > before, back in the early 1990's ;-) I'll need to read your responses to the above before I'm convinced. > > If > > the XAResource proxy itself throws an exception indicating an invalid > > heuristic decision, only then do we let on to the transaction manager. > > In this case, we are forced to resolve the heuristic, either > > automatically (by the transaction manager) > > The transaction manager will not do this automatically in most cases. That's fine. The main side effect of an unresolved heuristic decision is the inability to activate/synchronize any inactive databases. > > or manually (by and admin) > > before any deactivated databases are reactivated - to protect data > > consistency. > > Remember the end of Ghostbusters - crossing the streams has its uses > > and > > is often the only way to send Gozer back to her dimension. > > Yes, but you'd also be sending most of New York back with her ;) > > > > > > > Transaction recovery should work fine. > > Confidence ;-) > > Have you tried it? No, actually. The durability logic is still in a development branch and has yet to be tested thoroughly. > We have a very thorough test suite for JBossTS > that's been built up over the past 12 years (in Java). More than 50% > of it is aimed at crash recovery tests, so it would be good to try it > out. This would be an invaluable resource. > > HA-JDBC's > > XAResource.recover(int) will return the aggregated set of Xids from > > each > > resource manager. > > What happens if some replicas are unavailable during recovery? Do you > exclude them? Yes - they would be deactivated. > > HA-JDBC currently assumes that the list of prepared > > and heuristically completed branches is the same for each active > > database. This is only true because of the protected-commit > > functionality, and the single tolerance policy of HA-JDBC's XAResource > > for incorrect/mixed/hazard heuristic outcomes. > > > >>> > >>> > >>>>> > >>>>> HA-JDBC also considers failure of the JDBC client itself, e.g. the > >>>>> application server. A client that fails - then comes back online > >>>>> must > >>>>> get its state (i.e. which databases are presently active) either > >>>>> from > >>>>> one of the other client nodes (in the distributed case) or from a > >>>>> local > >>>>> store (in the non-distributed case). > >>>> > >>>> > >>>> OK, but what if it fails during an update? > >>> > >>> During an update of what? > >> > >> The client is sending an update request to the db and fails. How does > >> it figure out the final outcome of the request? What about partial > >> updates? How does it know? > >> > > > > I think I already answered this question below when discussing failure > > of a client node in the during transaction commit/rollback. > > > >>> > >>> > >>>> How are you achieving consensus across the replicas in this case? > >>> > >>> By replica, I assume you mean database? > >> > >> Yes. > > > > Again, discussed below. Consensus is achieved via distributed > > acknowledgements. > > Well hopefully we all know that consensus is impossible to achieve in > an asynchronous environment with failures, correct :-) ? I think we're talking apples and oranges... By consensus, I'm referring to the conflict resolution strategy. As I mentioned earlier, currently, this uses an optimistic approach, though as you pointed out, and I agreed, this should change. > >>> There is no communication > >>> between replicas - only between HA-JDBC clients. When HA-JDBC > >>> starts - > >>> the current database cluster state (i.e. which database are > >>> active) is > >>> fetched from the JGroups group coordinator. If it's the first > >>> member, > >>> or if not running in distributed mode, then the state is fetched > >>> from a > >>> local store (currently uses java.util.prefs, though this will be > >>> pluggable in the future). > >>> > >>>> For instance, the client says "commit" to the replica group (I > >>>> assume > >>>> that the client has a group-like abstraction for the servers so > >>>> it's > >>>> not dealing with them all directly?) > >>> > >>> Which client are you talking about? > >> > >> The user of the JDBC connection. > > > > Yes - there's a group-like abstraction for the servers. > > OK, so I'm even more confused than I was given what you said way back > (that made me ask the unicast/multicast question). Oh - sorry - I misunderstood your original question... By "group-like abstraction for the servers", you meant: does the HA-JDBC driver use multicast for statement execution? No. > >>> The application client doesn't know > >>> (or care) whether its accessing a single database (via normal JDBC) > >>> or a > >>> cluster of databases (via HA-JDBC). > >> > >> Well I agree in theory, but as I've pointed out so far it may very > >> well have to care if the information we give back is not the same as > >> in the single case. > >> > >>> HA-JDBC is instrumented completely > >>> behind the standard JDBC interfaces. > >>> > >>>> and there is a failure of the client and some of the replicas > >>>> during > >>>> the commit operation. What replica consistency protocol are you > >>>> using > >>>> to enforce the state updates? > >>> > >>> Ah - this is discussed below. > >>> > >>>> Plus how many replicas do you need to tolerate N failures (N+1, 2N > >>>> +1, > >>>> 3N+1)? > >>>> > >>> > >>> N+1. If only 1 active database remains in the cluster, then HA-JDBC > >>> behaves like a normal JDBC driver. > >> > >> So definitely no network partition failures then. > > > > I do attempt to handle them as best as I can. > > There are two scenarios we consider: > > 1. Network partition affects connectivity to all databases > > 2. Network partition affects connectivity to all but one database > > (e.g. > > one database is local) > > > > The first is actually easier, because the affect to all replicas is > > the > > same. In general, HA-JDBC doesn't ever deactivate all databases. The > > client would end up seeing SQLExceptions until the network is > > resolved. > > What about multiple clients? So far you've really been talking about > one client using a replica group for JDBC. That's pretty easy to > solve, relative to concurrent clients. Especially when you throw in > network partitions. Hopefully I cleared up this misconception earlier. I *have* been talking about multiple clients. Multicast is not used for statement execution, but rather for cluster state management. Sorry for the confusion. > > The seconds scenario is problematic. Consider an environment of two > > application server nodes, > > No, let's talk about more than 2. If you only do 2 then it's better to > consider a primary copy or coordinator-cohort approach rather than > true active replication. > > > and an HA-JDBC cluster of 2 databases, where > > the databases are co-located on each application server machine. > > During > > a network partition, each application server sees that its local > > database is alive, but connectivity to the other database is down. > > Rather than allowing each server to proceed writing to its local > > database, creating a synchronization nightmare, > > (split brain syndrome is the term :-) > > > the HA-JDBC driver goes > > into cluster panic, and requires manual resolution. Luckily, this > > isn't > > that common, since a network partition likely affects connectivity to > > the application client as well. We are really only concerned with > > database write requests in process when the partition occurs. > > Well I'm more concerned with the case where we have concurrent clients > who see different replica groups because of network partitioning. So > let's assume we have 3 replicas and they're not co-located with any > application server. Then assume that we have a partition that splits > them 2:1 so that one client sees 1 replica and assumes 2 failures and > vice versa. If your network was asymmetric in such a way, i.e. one client + 2 database on one subnet, and 2 clients + 1 databases on another subnet, you would still use the local="true|false", where local implies local to subnet - so this scenario can still be detected. > >>>> There's the application client, but the important "client" for this > >>>> failure is the coordinator. How are you assuming this failure mode > >>>> will be dealt with by the coordinator if it can't determine who did > >>>> what when? > >>> > >>> Let me elaborate... > >>> If protected-commit mode is enabled, then then commits and rollbacks > >>> will trigger a 2-phase acknowledgement broadcast to the other > >>> nodes of > >>> the JGroups application cluster. > >> > >> Great. Who is maintaining the log for that then? Seems to me that > >> this > >> is starting to look more and more like what a transaction coordinator > >> would do ;-) > > > > Essentially, that's what the XAResource is doing - coordinating the > > atomic behavior resource managers that it proxies. The advantage is - > > instead of reporting mixed/hazard heuristic outcomes due to in-doubt > > transactions - we can report a definitive outcome to the transaction > > manager, since the in-doubt resource managers can be deactivated. > > No you can't. Not unless you've got durability in the XAResource > proxy. Do you? If not, can you explain how you can return a definitive > answer when the proxy fails? Yes - there is durability in both the XAResource and Connection proxies. The log for this is maintained on each client node, and also replicated via jgroups so that remote nodes can recover from partial commits instead of waiting for the crashed node to recover itself. > >>> If transaction-mode="serial" then within the HA-JDBC driver, > >>> commits/rollbacks will execute against each database > >>> sequentially. In > >>> this case, we send a series of notifications prior to the > >>> commit/rollback invocation on each database; and a second series of > >>> acknowledgements upon completion. If the application server > >>> executing > >>> the transaction fails in the middle, the other nodes will detect the > >>> view change and check to see if there are any unacknowledged > >>> commits/rollbacks. If so, the unacknowledged databases will be > >>> deactivated. > >> > >> OK, but what about concurrent and conflicting updates on the > >> databases > >> by multiple clients? Is it possible that two clients start at either > >> end of the replica list and meet in the middle and find that they > >> have > >> already committed conflicting updates because they did not know about > >> each other? (A white board would be a lot easier to explain this.) > > > > This doesn't happen as the replica list uses consistent ordered across > > multiple clients. > > What ordering protocol do you use? Causal, atomic, global? I didn't mean to imply a broadcast protocol - the commits will occur in alphanumeric order of their ids (when transaction-mode="serial"). > > > > > >>> If transaction-mode="parallel" then within the HA-JDBC driver, > >>> commits/rollbacks will execute against each database in parallel. > >>> In > >>> this case, we send a single notification prior to the invocations, > >>> and a > >>> second acknowledgement after the invocations complete. If, on view > >>> change, the other nodes find an unacknowledged commit/rollback, then > >>> the > >>> database cluster enters a panic state and requires manual > >>> resolution. > >> > >> I assume parallel is non-deterministic ordering of updates too across > >> replicas? > > > > Yes - this mode is useful (i.e. faster) when the likelihood of > > concurrent updates of the same data is small. > > What do you do in the case of conflicts? If there are concurrent updates of the same data and transaction-mode="parallel", and transaction 1 happens to get to resource 1 first, but transaction 2 happens to get to resource 2 first, then each will wait for the other until their transactions time out triggering a rollback. Again, this mode is an optimization for systems with a very low likelihood of concurrent updates. > >>> [OT] I'm considering adding a third "hybrid" transaction mode, where > >>> transactions execute on first on one database, then the rest in > >>> parallel. This mode seeks to address the risk of deadlocks with > >>> concurrent updates affecting parallel mode, while enabling better > >>> performance than serial mode, for clusters containing more than 2 > >>> databases. This mode would benefit from fine-grained transaction > >>> acknowledgements, while requiring less overhead than serial mode > >>> when > >>> cluster size > 2. > >> > >> What is the benefit of this over using a transaction coordinator to > >> do > >> all of this coordination for you in the first place? > > > > HA-JDBC is designed to function in both JTA and non-JTA environments. > > The above logic will guard against in-doubt transactions in both > > environments. > > I may have more to say on this once we get over the question of > durability in your XAResource ;-) What other questions do you have about durability in HA-JDBC? > >>>> What about a complete failure of every node in the system, i.e., > >>>> the > >>>> application, coordinator and all of the JDBC replicas fails? In the > >>>> wonderful world of transactions, we still need to guarantee > >>>> consistency in this case. > >>>> > >>> > >>> This is also handled. > >>> Just as cluster state changes are both broadcast to other nodes and > >>> recorded locally, so are commit/rollback acknowledgements. > >> > >> And heuristic failure? In fact, all responses from each XAResource > >> operation? > > > > > > Hopefully, I already answered this question above. If not, can you > > elaborate? > > > > I don't think we're quite finished yet ;-) Me neither. It occurs to me now that the current durability behavior is not yet sufficient for XA transactions, since the Xid is only recorded before commit/rollback. This will only detect crashes during commit/rollback, will not detect heuristic decisions made if the client node crashes after prepare() but *before* commit/rollback. To address this, I really ought to record Xids in prepare(). When the unfinished tx is detected (either locally after restart or remotely after jgroups membership change), I'll need to perform a XAResource.recover() + rollback() for each recorded Xid, and resolve and conflicting results. Anything else? Paul |