You can subscribe to this list here.
| 2008 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2009 |
Jan
(1) |
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2013 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: amit s. <ami...@gm...> - 2013-03-14 11:34:53
|
Hello,
I am using datasource configuration for creating a high
availability cluster of database servers. The dialect used is a
OracleDialect.
The issue is if the primary server (having a higher weight) fails
the ha-jdbc library checks if the SQLException is a instance of
java.sql.SQLNonTransientConnectionException. If not then it does not
fallback on the next database instance. In case of oracle when a planned
outage occurs (by executing shutdown immediate) the SQLException thrown is
not an instance of SQLNonTransientConnectionException. It has a sql state
66000.
Below is the configuration code
DataSourceDatabaseClusterConfiguration config = new
DataSourceDatabaseClusterConfiguration();
config.setDatabases(Lists.newArrayList(db1, db2));
config.setDialectFactory(new OracleDialectFactory());
config.setStateManagerFactory(new SimpleStateManagerFactory());
config.setDatabaseMetaDataCacheFactory(new
SimpleDatabaseMetaDataCacheFactory());
config.setDurabilityFactory(new NoDurabilityFactory());
config.setBalancerFactory(new SimpleBalancerFactory());
Map<String, SynchronizationStrategy> synchronizationStrategies =
Maps.newHashMap();
synchronizationStrategies.put("passive", new
PassiveSynchronizationStrategy());
config.setSynchronizationStrategyMap(synchronizationStrategies);
config.setDefaultSynchronizationStrategy("passive");
DataSource ds = new DataSource();
ds.setCluster("cluster");
ds.setConfigurationFactory(new
SimpleDatabaseClusterConfigurationFactory<>(config));
Is there something in the configuration I am missing?
Thanks,
Amit.
P.S - There is a hack to this - extend the dialect and override the
indicateFailure() method. But that defeats the purpose of using the library
:).
|
|
From: Mark L. <ml...@re...> - 2009-12-07 09:19:05
|
Ah we meet again 6 months later :-) This has been stuck in my drafts for a while. I'll cut it down so we can concentrate on anything that really needs more discussion. >> >>> The purpose of the lock is to isolate >>> transaction boundaries, so that the synchronization process (through >>> acquisition of the write side of this lock) can ensure that no >>> transactions are in progress (including local auto-commits). This >>> is >>> the traditional single writer, concurrent readers - however, in this >>> case the "writer" is the synchronization process, and the readers >>> are >>> transaction blocks. >> >> Do you have any diagrams or text on how this is supposed to work with >> multiple clients running on disparate VMs? > > No - but it is simple enough to explain. This is done via a single > ReadWriteLock in the HA-JDBC driver. Interceptors acquire and release > read locks at the client side JDBC/JTA transaction boundaries (acquire > on begin, release on end). The successful acquisition of the write > lock > means that all current transactions have completed and no new ones can > start, establishing a transactional quiescent period for the > duration of > the activation process. > This, again, relies on the assumption that all database access will > use > the HA-JDBC driver. > When the same cluster is accessed by multiple JVMs, as indicated by > the > addition of a <distributable/> element in ha-jdbc's config file, we > use > a distributable implementation of the ReadWriteLock. While the read > side of the lock works identically to the local case, the write side > of > the lock use JGroups rpc. The write lock will not succeed until it > can > be acquired on all nodes. We track lock ownership per node such > that a > dropped group member will have all its owned locks released. We may have covered this already (the discussion has been going on so long I'm starting to forget what we have covered). However, my query here has to do with ensuring that you don't have multiple clients locking the same group at the same time and ending up in a situation where they compete due to the inability to ensure message ordering. It sounds like from your description that there is only one lock because there's only a single drive instance for each group, so if there were multiple clients/users of the same group they'd have to share this same driver instance (single point of failure?). Correct? > >> Or maybe you assume the >> database will sort that out? > > Since the above process establishes a period of transactional > quiescence, there is no need for explicit locking within the databases > themselves. Agreed, assuming you have a single coordinator that all clients must use. But that brings issues in and of itself. >> >> OK. One further clarification: in step 1 this is a lock on the entire >> cluster (all database instances) so any clients running anywhere else >> are blocked until the group view membership change has completed? > > This is a lock on all HA-JDBC *driver* instances. > Clients are blocked until group *state* change is complete. > > Point of clarification: > group view = set of ha-jdbc clients > group state = set of active databases, e.g. [db1, db2, db3] OK, so this would seem to indicate that if I have an HA-JDBC driver A, and clients C1, C2 and C3, then all clients (C1...3) must interact with the databases through the same client, A? > >> You cannot assume that because all >> of the replicas have the same initial state and receive the same set >> of messages in the same order that they will always come to the same >> final state. > > I don't assume that. Replicas are assumed to have the same final > state > if the above is true *and* the responses from a given transactional > operation were the same. Seems like a valid assumption to me. Not really. Only if you can assume that they operate in lock-step, have the same concept of time and, realistically, synchronize their operations every time they take a decision that is potentially non- deterministic. There are a number of papers on this, but Voltan (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.1757 ) is one system that springs to my mind immediately. But basically this is why we have multiple types of replica consistency protocol, typically active and passive. There are trade-offs with them, as I'm sure you know. Passive is often poorer in performance due to fail-over requirements, but can work in any environment with any application being replicated because we would impose one view (the primary's) over all of the other replicas (the backups or cohorts). Active is often faster, but assumes that it's impossible to have non-deterministic behavior, i.e., that each replica makes the exact same choices and ends up at the exact same state given the same start state and sequence of input messages. Unfortunately in the real world that is often not possible (see Voltan). So this is my concern here. > >> That's an assumption we make in many active replication >> protocols and it's fine in a very select environment. But throw in >> time, subatomic/high energy particles, and the 3rd law of >> thermodynamics, and it falls down. Which is why primary copy/ >> coordinator-cohort replication is the dominant replica consistency >> protocol in the world. > > Perhaps I should be more precise: > In a typical 2PC use case, unanimity is necessary because all > resources > involved matter, i.e. transaction success is contingent on the success > of all members. In HA-JDBC, unanimity is not necessary - > ultimately, we > only require the response of at least 1 resource - the rest are > technically dispensable (i.e. subject to deactivation). It is the > responsibility of the response coordination strategy to decide which > response should be taken as the definitive one. OK, so this works if you ensure that all of the backup replicas then get the state of the primary if they don't already. I can see how you could do that, e.g., as soon as you get the first response you exclude all of the backups until they either check they have that state or some recover subsystem imposes that state. But it's starting to sound like passive replication. > e.g. > Take a logical cluster of 3 databases (A, B, C) using ordered (as > opposed to optimistic, pessimistic, etc.) response coordination. If > it > helps, think of A as the master, B as the primary slave, and C as the > secondary slave. Think of HA-JDBC as synchronous replication, but > instrumented in the driver, rather than the master database (the > typical > mechanism). If the commit of A succeeds or fails, then we expect/ > hope B > and C to also succeed or fail accordingly. However, the successful > commit on A is *not* predicated on the successful commit of B and C - > which would otherwise be the use case for 2PC. I never thought it was predicated, but that doesn't mean they'll have the same state ;) > If B and/or C do not > behave as we expected, we simply deactivate those databases whose > responses deviate from A's. Worse case scenario, the responses for B > and C are not consistent with A and HA-JDBC degrades to the behavior > of > a single database (i.e. a slave-less master). This is valid, albeit > not > ideal - but not likely either. So you check the states of B and C *before* returning a result and *before* allowing other clients to use this replica group? Unless you exclude B and C as soon as you get an answer from A I'm concerned that with B and C remaining in the group, a failure at this point (before any of those options I mentioned happens) a subsequent client could come in and read from B, whose state is different to A (in this scenario). > However, if A's failure is detected to be catastrophic (determined > by a > subsequent simple query), then A is deactivated and B now assumes > master > status and becomes the new authoritative response. If C's response is > consistent with B's, great, if not deactivate. This second scenario > is > where HA-JDBC aims to add value. > > Hopefully, that's a better explanation. > That said, do you still think I'm wrong? I think you're saying the same thing, or rather that we're converging. However, could you work through a scenario starting with two clients and 3 replicas; the clients want to use the group at roughly the same time and will be modifying/looking at the same data. Let's ignore failures initially, and then inject failures. I'm looking to ensure that the above situation, where the second client reads wrong data from a backup, doesn't happen. As I said, if the first client doesn't get a response until the entire group is guaranteed to be consistent (even if that means contracting the group size to a single replica) then I think I'm with you. > > Thinking this over, perhaps the addition of a configuration option > allowing HA-JDBC's XAResource proxy to implement 2PC in the 1PC > scenario > might, on average, decrease the likelihood of cluster degradation, so > maybe it's a worthwhile effort. > > N.B. HA-JDBC configured with "ordered" response coordination and > "simple" balancer (i.e. reads always go to master) will behave like > synchronous primary-copy replication. > > >> If your approach were valid then we wouldn't need transactions at >> all, >> now would we ;-) ? > > Sorry - your last statement seems like a non sequitur. An application > should not care that their data is committed to *all* databases in the > HA-JDBC cluster, only the ones that remain active after the commit is > complete. Yes, but as you can hopefully see, I'm not sure if that is the case here. > >>> There should >>> really be a more flexible mechanism for handling this. Perhaps a >>> configuration option to determine how to HA-JDBC should resolve >>> conflicts. Possible values for this would be: >>> * optimistic: Assume that user transaction expects success (current >>> behavior) >>> * pessimistic: Rollback if possible, otherwise halt cluster and log >>> conflict. >>> * paranoid: Halt cluster, log conflict. >> >> This would worry me if I were a client. "I'm using transactions for a >> reason. Please don't break the ACID semantics without informing >> me." ;) > > I don't follow you... Exactly which semantic (A, C, I, or D) do you > believe is broken from the perspective of the client's transaction? Atomic and consistent if it's possible for the 2nd client to read information that the 1st didn't. But let's work through that example. >>> >> >> Not if you're not running 2PC between the XAResource instances within >> the proxy. Consider failures of the proxy as well as failures of the >> participants. > > The level of durability is configurable, durability="fine|coarse| > none": > * fine : Uses database granularity. Records/broadcasts Xid, database > id, and responses before prepare and after commit/rollback per > database. > Can detect and recover from unacknowledged commit/rollback by > deactivating the appropriate database(s). This is the preferred mode > for the distributed use case. > * coarse : Uses proxy granularity. Records/broadcasts Xid before > proxy > prepare and after processing of proxy commit/rollback. Can only > detect > that an unacknowledged commit may have occurred, but lacks the > detail of > which databases were affected. Requires manual recovery. This mode > is > more efficient than fine and is adequate for the non-distributed use > case. > * none : Nothing recorded. Optimization for read-only clusters. This is one we may have discussed already, but where is this information recorded? > >>> If forget fails, the error is logged and db2 is deactivated. >>> If the proxy crashes, then we follow the "protected-commit" logic >>> discussed earlier. >> >> Can you take me through a scenario where you have a proxy crash while >> interacting with the participants? > > The recovery process is run either by a peer node that detected the > crash when the node dropped from group membership, or when HA-JDBC > restarts. > Below are the recovery processes for each durability mode: > > fine: > * Check commit/rollback durability records: > - If unacknowledged commits/rollbacks exist, deactivate these > databases. > - If uninitiated commits/rollbacks exist, deactivate these > databases. > - If unprocessed commit/rollback responses exist, process these > according to the configured response coordination strategy. > * Check prepare durability records: > - If unacknowledged prepares exist, deactivate these databases. > - If uninitiated prepares exist, deactivate these databases. > - If unprocessed prepare responses exist, process these according to > the configured response coordination strategy. > > coarse: > * Check commit/rollback durability records: > - If unacknowledged commits/rollbacks exist, stop the cluster, and > log reason. > * Check prepare durability records: > - If unacknowledged prepares exist, stop the cluster, and log reason > > none: > * No recovery > > Note that the recovery process above operates independently from any > transaction manager recovery. This seems to assume the information is stored with the HA-JDBC driver, right? So this comes back to an earlier statement/assumption I'm making: in all cases all clients/users of a replica group must go through the same driver instance? If that's not the case how can you ensure the above, unless the individual replicas are maintaining interaction state and a new driver instance can somehow recreate the exact state that the previous instance had, in order to ensure consistency. > >>>>> 6 void [XAER_*] void, deactivate db2 >>>>> 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] >>>>> 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate >>>>> db2 >>>> >>>>> >>>>> 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate >>>>> db2 >>>> >>>> This is more concerning because you've got a mixed heuristic >>>> outcome >>>> which means that this resource was probably a subordinate >>>> coordinator >>>> itself or modifying multiple entries in the database. That'll make >>>> automatic recovery interesting unless you expect to copy one entire >>>> database image over the top of another? >>> >>> During synchronization prior to database activation, all prepared >>> and >>> heuristically completed branches on the target database are >>> forgotten. >> >> But you can't just forget autonomously. Have you looked at how >> heuristic resolution works in the real world? If we could simply call >> forget as soon as we get a heuristic then it's be easy ;-) >> Unfortunately heuristic resolution often requires semantic >> information >> about the application and data, which typically means manual >> intervention. A heuristic commit is not the same as a commit, even if >> commit was called by the application. Likewise for a heuristic >> rollback. > > Why not? The "target" database I'm talking about above, against which > any heuristic decisions are to be forgotten, is the database that is > about to be re-sync'ed. The outcome of any heuristic decisions - or > even the fact that they happened - are no longer of any consequence. > What would be the point of a manual heuristic resolution, just to have > the data wiped and re-written? Because a heuristic means that something happened that could have been read by others and propagated outside of the system (to other tables in the same database, to people elsewhere in the world, to another system you don't control etc.) Unless you can guarantee that each database instance can only ever be ready through your protocol, then it is wrong to autonomously forget them. However, I think you are making that assumption, right, i.e., that when a database instance is tied into HA-JDBC it is not allowed to be used through any other mechanism directly? > > I think you are confusing this with heuristic decisions on the *active > databases*. These indeed require resolution of the kind to which you > refer. As I mentioned before, resolution of any heuristic decisions > are > required before any inactive databases can be re-activated. No, I'm not confusing this at all. >> >> I disagree. Maybe we need a whiteboard ;-) > > I think you're still operating under the assumption that each database > of the cluster must present a consistent view when accessed externally > (i.e. not using HA-JDBC). Again, this is not a supported use case. OK, I think that may answer my question above. > > Otherwise, I don't see how allowing 1PC would risk more heuristic > outcomes. Well you're likely to get more heuristic outcomes, it's just that you are going to mask them so as far as a client/user is concerned there won't be any difference. >> >> No because you do not know what else happened to cause the heuristic. >> Tell me: what do you think happens to generate a heuristic outcome? >> What's going on at the participant? What about other clients that may >> be looking at the same database? Have you read up on cascading >> rollbacks as well? > > You seem to be suggesting that heuristically completed branches are > allowed to violate isolation semantics. Is this true? A heuristic outcome like heuristic-rollback means that the RM has rolled back any state changes you may have been making. That then means another client(s) could come in and read the wrong state or modify that state. As far as the RM is concerned the transaction ended when it rolled back, so there is no violation of isolation. Unfortunately as far as the original transaction is concerned it breaks atomicity and, potentially, isolation. >> >> Ah, OK. So what are your assumptions. > > 1. All access to the databases is done via the HA-JDBC driver, i.e. > only > Java clients are allowed. > 2. If multiple JVMs access the same HA-JDBC cluster, then: > a. The HA-JDBC configuration for each process must be identical > b. The HA-JDBC cluster is designated as distributed (i.e. jgroups > enabled), enabling the replication of cluster state. By 2a you mean they must all use the same driver instance? > >>> One of HA-JDBC's requirements is that all access to the databases of >>> the >>> cluster go through the HA-JDBC driver. >> >> But what about concurrent clients on *different* VMs. Today they'd >> all >> use different JDBC driver instances. If they were running within >> transactions then we'd get data consistency. Heuristic outcomes would >> be resolved by sys admins (typically). What's the equivalent? > > The equivalent for: > * driver > - Each VM uses it's own instance of the HA-JDBC driver, which manages > a JDBC driver instances to N locations. The cluster state, i.e. set > of > active databases, is replicated across each instance. Are these N locations the database replicas, or do you form a replica group of HA-JDBC drivers so that each VM is essentially coordinating with the others? > * data consistency > - No additional mechanism is needed to ensure consistency across > databases. Data consistency across databases is maintained via local > transaction isolation, combined with consistent execution order across > the individual databases in the cluster. Typical database writes will > execute against the 1st database in cluster, establishing > exclusivity to > that data, then the remaining databases in the cluster in parallel. > Any > subsequent requests to modify the same data will block against the 1st > database, removing the possibility of lock contention that might cause > data inconsistency across databases. If each HA-JDBC driver executes on the same database replicas in isolation to the other driver instances then there's still an issue with overlapping requests, particularly where locking is concerned. Then throw in failures and it becomes more interesting. So I'm hoping that you form an HA-JDBC driver replica group too :-) > * heuristic resolution > - Heuristic outcomes are still resolved by sys admins, but only need > to happen those databases considered active. If a heuristic outcome > exists on one active database, it will exist on all active > databases, as > maintained by the response coordination mechanism. > >> >> Your heuristic conflict/resolution matrix is flawed. > > I get it - you're not a fan of my optimistic response coordination > strategy. Would you say the same for the equivalent (and more > trivial, > i.e. always use column A) matrix for the ordered response coordination > strategy? I am no longer so sure about that either ;-) >>> >> >> When does the durability happen and what is recorded? > > Depends on the use case and the configured durability level. > durability="fine": > * JTA, 2PC: > - Record Xid and database id before prepare(Xid), per database > - Record prepare response after prepare(Xid), per database > - Record Xid and database id before commit(Xid, false) or > rollback(Xid), per database > - Record commit/rollback response per database > - Clear above records after proxy processes commit/rollback responses > * JTA, 1PC: > - Record Xid and database id before commit(Xid, true), per database > - Record commit response per database > - Clear above records after proxy processes commit responses > * non-JTA, autocommit=false: > - Record local tx id and database id before commit(), per database > - Record commit response per database > - Clear above records after proxy processes commit responses > * non-JTA, autocommit=true: > - Record local tx id and database id before statement execution, > per database > - Record statement execution response per database > - Clear above records after proxy processes statement execution > responses Ouch - a lot of access to some durable store. When you consider that your friendly transaction manager is trying to optimize performance as much as possible, typically by using presumed abort, this really starts to look like a lot of overhead. You're imposing a presumed- nothing protocol (or close enough to it) on top of whatever the transaction manager is doing. > > durability="coarse": > * JTA, 2PC: > - Record Xid before prepare(Xid) > - Clear above records after proxy processes commit(Xid, false)/ > rollback(Xid) responses > * JTA, 1PC: > - Record Xid before commit(Xid, true) > - Clear record after proxy processes commit responses > * non-JTA, autocommit=false: > - Record local tx id before commit() > - Clear record after proxy processes commit responses > * non-JTA, autocommit=true: > - Record local tx id before statement execution > - Clear record after proxy processes statement execution responses You should be able to integrate replication of datasources/resource managers in a way that works with the transaction manager protocol, rather than against it. Mark. --- Mark Little ml...@re... JBoss, by Red Hat Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland). |
|
From: Paul F. <pau...@re...> - 2009-06-26 21:23:51
|
On Thu, 2009-03-12 at 15:53 +0000, Mark Little wrote: > Hi Paul. Here's my response for this quarter ;-) > > On 10 Feb 2009, at 22:05, Paul Ferraro wrote: > > > On Fri, 2009-01-30 at 12:20 +0000, Mark Little wrote: > >> Hi Paul. Sorry for the delay in replying ... > > > > No worries. Thanks for the thought and time you've put into this > > conversation so far. > > It's interesting. Although I think it would be a lot easier to thrash > this out face-to-face: the intermittent responses mean I have a lot of > context switching to do ;-) I'm sure the same is true for you. Yes - that would be ideal. I had hoped that JBoss World might be that opportunity, but, given that both of my presentation proposals were rejected, there is little chance I will be attending. > >> On the whole database or just on the data that needs updating? The > >> former is obviously easier but leads to more false locking, whereas > >> the latter can be harder to ensure but does allow for continuous use > >> of the remaining database tables (cf page level locking versus object > >> level locking.) > >> > >>> New transactions (both local and XA) > >>> acquire a local read locks on begin and release on commit/rollback. > >> > >> Locking what? > > > > A single ReadWriteLock (per logical cluster) within the HA-JDBC > > driver. > > OK, but is this on the entire data stored within the cluster instance? The initial question got snipped - but your next response seems to indicate that this question was already answered. > >>> The > >>> acquisition of a distributed write lock effectively blocks any > >>> database > >>> writes across the whole cluster, while still allowing database > >>> reads. > >> > >> OK, so this sounds like it's a lock on the database itself rather > >> than > >> any specific data items. However, it also seems like you're not using > >> the traditional exclusive writer, concurrent readers. So how do you > >> prevent dirty reads of data in that case? Are you assuming optimistic > >> locking? > > > > I'm talking about a client-side RWLock, not a database lock. > > Remember, > > HA-JDBC is entirely client-side. > > Good point. I did remember that in some of the initial discussions. > > > The purpose of the lock is to isolate > > transaction boundaries, so that the synchronization process (through > > acquisition of the write side of this lock) can ensure that no > > transactions are in progress (including local auto-commits). This is > > the traditional single writer, concurrent readers - however, in this > > case the "writer" is the synchronization process, and the readers are > > transaction blocks. > > Do you have any diagrams or text on how this is supposed to work with > multiple clients running on disparate VMs? No - but it is simple enough to explain. This is done via a single ReadWriteLock in the HA-JDBC driver. Interceptors acquire and release read locks at the client side JDBC/JTA transaction boundaries (acquire on begin, release on end). The successful acquisition of the write lock means that all current transactions have completed and no new ones can start, establishing a transactional quiescent period for the duration of the activation process. This, again, relies on the assumption that all database access will use the HA-JDBC driver. When the same cluster is accessed by multiple JVMs, as indicated by the addition of a <distributable/> element in ha-jdbc's config file, we use a distributable implementation of the ReadWriteLock. While the read side of the lock works identically to the local case, the write side of the lock use JGroups rpc. The write lock will not succeed until it can be acquired on all nodes. We track lock ownership per node such that a dropped group member will have all its owned locks released. > Or maybe you assume the > database will sort that out? Since the above process establishes a period of transactional quiescence, there is no need for explicit locking within the databases themselves. > >>> The activation process is: > >>> 1. Acquire distributed write lock > >>> 2. Synchronize against an active database using a specified > >>> synchronization strategy > >>> 3. Broadcast an activation message to all client nodes > >> > >> I don't quite get the need for 3, but then this may have been covered > >> before. Why doesn't the newly activated replica simply rejoin the > >> group? Aren't we just using the groupid to communicate and not > >> sending > >> messages from clients to specific replicas, i.e., multicast versus > >> multiple unicasts? (BTW, I'm assuming "client" here means database > >> user.) > > > > Because the replica is not a group member. > > A few points of architectural clarification... > > I use the term "client node" to refer to a DatabaseCluster object (the > > highest abstraction in HA-JDBC). This is the object to which all > > end-user JDBC facades interact. The DatabaseCluster manages 1 or more > > Database objects - a simple encapsulation for the connectivity details > > for a specific database. The Database object comes in 4 flavors: > > Driver, DataSource, ConnectionPoolDataSource, and XADataSource > > Yes, that much you can assume to be taken for granted ;-) Not necessarily - Sequoia (http://sequoia.continuent.org) only instruments the application exposed DataSource and so is particularly naive when it comes to distributed transactions. ;) > > When the client requests a connection from the HA-JDBC driver, HA-JDBC > > DataSource, or app server DataSource backed by an HA-JDBC > > ConnectionPoolDataSource or XADataSource, the object returned is a > > dynamic proxy backed by the DatabaseCluster, containing a map of > > Connections to each active database in the cluster. The Connection > > proxy begets Statement proxies, etc. > > One of the prime duties of the DatabaseCluster is managing the cluster > > state, i.e. the set of active databases. This is represented as a set > > of string identifiers. > > In the distributed use case, i.e. where the HA-JDBC config file > > defines > > the cluster as <distributable/> (a la web.xml), the cluster state is > > replicated all participating application servers via jgroups. The > > jgroups membership is the set of app servers accessing a given HA-JDBC > > cluster (not the replicas themselves). The activation process I > > described above occurs on a given node of the application server > > cluster. The activation message to which I referred tells each group > > member to update its state, e.g. from [db1, db2] -> [db1, db2, db3] > > OK. One further clarification: in step 1 this is a lock on the entire > cluster (all database instances) so any clients running anywhere else > are blocked until the group view membership change has completed? This is a lock on all HA-JDBC *driver* instances. Clients are blocked until group *state* change is complete. Point of clarification: group view = set of ha-jdbc clients group state = set of active databases, e.g. [db1, db2, db3] > >> No, you misunderstand. If you are proxying as you mentioned, then > >> you're effectively acting as a subordinate coordinator. Your > >> XAResource is hiding 1..N different XAResource instances. In the case > >> of the transaction manager, it sees only your XAResource. This means > >> that if there are no other XAResources involved in the transaction, > >> the coordinator is not going to drive your XAResource through 2PC: > >> it'll call commit(true) and trigger the one-phase optimization. Now > >> your XAResource knows that there are 1..N instances in reality that > >> need coordinating. Let's assume it's 2..N because 1..1 is the > >> degenerate case and is trivial to manage. This means that your > >> XAResource is a subordinate coordinator and must run 2PC on behalf of > >> the root coordinator. But unfortunately the route through the 2PC > >> protocol allows for a number of different error conditions than the > >> route through the 1PC termination protocol. So your XAResource will > >> have to translate 2PC errors, for example, into 1PC (and sometimes > >> that mapping is not one-to-one.) Do you handle that? > > > > I see what you're saying now - but I disagree with your assertion that > > HA-JDBC's XAResource proxy "must run 2PC on behalf of the root > > coordinator" in the single phase optimization scenario. > > The point of 2PC is to ensure consensus that a unit of work can be > > committed across multiple resources. > > I think you can assume I know the point of 2PC ;-) > > > In HA-JDBC, since the resources > > managed by the proxy are presumed to be identical by definition, the > > prepare phase is unnecessary since we expect the response to be > > unanimous one way or the other. Therefore, the single phase > > optimization is entirely appropriate. > > Nope, you're wrong. You can make all of the assumptions you want, but > in reality they are worthless. > Failures may happen for a number of reasons and not result in crashes. Of course - I don't think I made any claims to the contrary. > You cannot assume that because all > of the replicas have the same initial state and receive the same set > of messages in the same order that they will always come to the same > final state. I don't assume that. Replicas are assumed to have the same final state if the above is true *and* the responses from a given transactional operation were the same. Seems like a valid assumption to me. > That's an assumption we make in many active replication > protocols and it's fine in a very select environment. But throw in > time, subatomic/high energy particles, and the 3rd law of > thermodynamics, and it falls down. Which is why primary copy/ > coordinator-cohort replication is the dominant replica consistency > protocol in the world. Perhaps I should be more precise: In a typical 2PC use case, unanimity is necessary because all resources involved matter, i.e. transaction success is contingent on the success of all members. In HA-JDBC, unanimity is not necessary - ultimately, we only require the response of at least 1 resource - the rest are technically dispensable (i.e. subject to deactivation). It is the responsibility of the response coordination strategy to decide which response should be taken as the definitive one. e.g. Take a logical cluster of 3 databases (A, B, C) using ordered (as opposed to optimistic, pessimistic, etc.) response coordination. If it helps, think of A as the master, B as the primary slave, and C as the secondary slave. Think of HA-JDBC as synchronous replication, but instrumented in the driver, rather than the master database (the typical mechanism). If the commit of A succeeds or fails, then we expect/hope B and C to also succeed or fail accordingly. However, the successful commit on A is *not* predicated on the successful commit of B and C - which would otherwise be the use case for 2PC. If B and/or C do not behave as we expected, we simply deactivate those databases whose responses deviate from A's. Worse case scenario, the responses for B and C are not consistent with A and HA-JDBC degrades to the behavior of a single database (i.e. a slave-less master). This is valid, albeit not ideal - but not likely either. However, if A's failure is detected to be catastrophic (determined by a subsequent simple query), then A is deactivated and B now assumes master status and becomes the new authoritative response. If C's response is consistent with B's, great, if not deactivate. This second scenario is where HA-JDBC aims to add value. Hopefully, that's a better explanation. That said, do you still think I'm wrong? Thinking this over, perhaps the addition of a configuration option allowing HA-JDBC's XAResource proxy to implement 2PC in the 1PC scenario might, on average, decrease the likelihood of cluster degradation, so maybe it's a worthwhile effort. N.B. HA-JDBC configured with "ordered" response coordination and "simple" balancer (i.e. reads always go to master) will behave like synchronous primary-copy replication. > Take a look at some of the papers I referenced in initial emails. Many > of them talk about integrating transactions with active replication. I'll check them out. > If your approach were valid then we wouldn't need transactions at all, > now would we ;-) ? Sorry - your last statement seems like a non sequitur. An application should not care that their data is committed to *all* databases in the HA-JDBC cluster, only the ones that remain active after the commit is complete. > >>>> As a slight aside: what do you do about non-deterministic SQL > >>>> expressions such as may be involved with the current time? > >>> > >>> HA-JDBC can be configured to replace non-deterministic SQL > >>> expressions > >>> with client generated values. This behavior is enabled per > >>> expression > >>> type (e.g. eval-current-timestamp="true"): > >>> http://ha-jdbc.sourceforge.net/doc.html#cluster > >>> Of course, to avoid the cost of regular expression parsing, it's > >>> best to > >>> avoid these functions altogether. > >> > >> Yes, but if we want to opaquely replace JDBC with HA-JDBC then we > >> need > >> to assume that the client won't be doing anything to help us. > > > > The client never needs to provide runtime hints on behalf of HA-JDBC. > > Non-deterministic expression evaluation is enabled/disabled in the > > HA-JDBC configuration file. > > How does it know which statements to look for? Based on the configured dialect (abstraction for vendor specific SQL - Hibernate has an analogous construct). e.g. http://ha-jdbc.sourceforge.net/api/net/sf/hajdbc/Dialect.html#evaluateCurrentDate(java.lang.String,%20java.sql.Date) > >> What are the performance implications so far and with your future > >> plans? > > > > Currently, a given "next value for sequence" requires a jgroups rpc > > call > > to acquire an exclusive lock (per sequence) on each client node, and > > an > > subsequent rpc call to release the lock. > > The future approach doesn't improve performance (makes it slightly > > worse > > actually, by requiring an additional SQL statement), but rather > > reliability, since the first approach doesn't verify that the next > > sequence value was actually the same on each database, but trusts that > > it is, given its exclusive access. As I write this, it occurs to me > > that Statement.getGeneratedKeys() can still be leveraged to verify > > consistency without requiring separate ALTER SEQUENCE statements on > > the > > "slave" databases. So at this point, I need to think about this more. > > OK. I'm interested. At this point, I'm leaning towards keeping the current approach, and adding the getGeneratedKeys() validation, if supported by the driver. This avoid the need to modify the SQL sent to the underlying databases. > >> This should mean more than deactivating the database. If one resource > >> thinks that there was no work done on the data it controls whereas > >> the > >> other does, that's a serious issue and you should rollback! commit > >> isn't always the right way forward! This is different to some of the > >> cases below because there are no errors reported from the resource. > >> As > >> such the problem is at the application level, or somewhere in the > >> middleware stack. That fact alone means that anything the committable > >> resource has done is suspect immediately. > > > > Good point. #3 is really an exception condition, since it violates > > HA-JDBC's assumption that all databases are identical. > > See what I meant above ;-) ? The response matrix (also snipped) described optimistic response coordination. Ordered response coordination would have always used the result of the first database. > > There should > > really be a more flexible mechanism for handling this. Perhaps a > > configuration option to determine how to HA-JDBC should resolve > > conflicts. Possible values for this would be: > > * optimistic: Assume that user transaction expects success (current > > behavior) > > * pessimistic: Rollback if possible, otherwise halt cluster and log > > conflict. > > * paranoid: Halt cluster, log conflict. > > This would worry me if I were a client. "I'm using transactions for a > reason. Please don't break the ACID semantics without informing me." ;) I don't follow you... Exactly which semantic (A, C, I, or D) do you believe is broken from the perspective of the client's transaction? > > > > * chain-of-command: Assume the result of the 1st available database is > > the authoritative response - deactivate any deviants > > Have you looked at Available Copies replication. I did when - while there analogous aspects to HA-JDBC, there are several mechanistic differences. > > A "chain-of-command" strategy is really a more appropriate default for > > HA-JDBC, as opposed to the current optimistic behavior. > > Maybe. Why not? BTW, I've begun to referring to this strategy as "ordered". > >>> 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 > >>> 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 > >>> 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) > >>> 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), > >>> rollback and deactivate db1 > >>> 8 XA_OK [XAER_*] XA_OK, deactivate db2 > >>> 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 > >>> 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 > >>> 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > >>> same > >> > >> Hmmm, well that's not good either, because XAER_* errors can mean > >> very > >> different things (in the standard as well as in the implementations). > >> So simply assuming that all of the databases are still in a position > >> where they can be treated as a single instance is not necessarily a > >> good idea. > > > > Good point. If the XAER_* error codes are not the same, I should > > respond using the configured conflict resolution strategy. > > Optimistic strategy would keep the database(s) with the highest error > > code (i.e. least severe error), and deactivate the others. > > Pessimistic/Paranoid strategy would halt the cluster. > > Chain-of-command strategy would return the XAER_* from db1 and > > deactivate db2. > > Yes. > > >> I understand all of what you're trying to do (it's just using the > >> available copies replication protocol, which has been around for > >> several decades.) But even that required a level of durability at the > >> recipient in order to deal with failures between it and the client. > >> The recipient in this case is your XAResource proxy. > > > > This durability does exists both in the XAResource proxy and the > > Connection proxy (for non-JTA transactions) - see the section on > > protected commit mode. Isn't that sufficient? > > Not if you're not running 2PC between the XAResource instances within > the proxy. Consider failures of the proxy as well as failures of the > participants. The level of durability is configurable, durability="fine|coarse|none": * fine : Uses database granularity. Records/broadcasts Xid, database id, and responses before prepare and after commit/rollback per database. Can detect and recover from unacknowledged commit/rollback by deactivating the appropriate database(s). This is the preferred mode for the distributed use case. * coarse : Uses proxy granularity. Records/broadcasts Xid before proxy prepare and after processing of proxy commit/rollback. Can only detect that an unacknowledged commit may have occurred, but lacks the detail of which databases were affected. Requires manual recovery. This mode is more efficient than fine and is adequate for the non-distributed use case. * none : Nothing recorded. Optimization for read-only clusters. > > If forget fails, the error is logged and db2 is deactivated. > > If the proxy crashes, then we follow the "protected-commit" logic > > discussed earlier. > > Can you take me through a scenario where you have a proxy crash while > interacting with the participants? The recovery process is run either by a peer node that detected the crash when the node dropped from group membership, or when HA-JDBC restarts. Below are the recovery processes for each durability mode: fine: * Check commit/rollback durability records: - If unacknowledged commits/rollbacks exist, deactivate these databases. - If uninitiated commits/rollbacks exist, deactivate these databases. - If unprocessed commit/rollback responses exist, process these according to the configured response coordination strategy. * Check prepare durability records: - If unacknowledged prepares exist, deactivate these databases. - If uninitiated prepares exist, deactivate these databases. - If unprocessed prepare responses exist, process these according to the configured response coordination strategy. coarse: * Check commit/rollback durability records: - If unacknowledged commits/rollbacks exist, stop the cluster, and log reason. * Check prepare durability records: - If unacknowledged prepares exist, stop the cluster, and log reason none: * No recovery Note that the recovery process above operates independently from any transaction manager recovery. > >>> 6 void [XAER_*] void, deactivate db2 > >>> 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] > >>> 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 > >> > >>> > >>> 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate > >>> db2 > >> > >> This is more concerning because you've got a mixed heuristic outcome > >> which means that this resource was probably a subordinate coordinator > >> itself or modifying multiple entries in the database. That'll make > >> automatic recovery interesting unless you expect to copy one entire > >> database image over the top of another? > > > > During synchronization prior to database activation, all prepared and > > heuristically completed branches on the target database are forgotten. > > But you can't just forget autonomously. Have you looked at how > heuristic resolution works in the real world? If we could simply call > forget as soon as we get a heuristic then it's be easy ;-) > Unfortunately heuristic resolution often requires semantic information > about the application and data, which typically means manual > intervention. A heuristic commit is not the same as a commit, even if > commit was called by the application. Likewise for a heuristic rollback. Why not? The "target" database I'm talking about above, against which any heuristic decisions are to be forgotten, is the database that is about to be re-sync'ed. The outcome of any heuristic decisions - or even the fact that they happened - are no longer of any consequence. What would be the point of a manual heuristic resolution, just to have the data wiped and re-written? I think you are confusing this with heuristic decisions on the *active databases*. These indeed require resolution of the kind to which you refer. As I mentioned before, resolution of any heuristic decisions are required before any inactive databases can be re-activated. > > Synchronization will not occur (i.e. not be allowed to start, since it > > won't be able to acquire the DatabaseCluster's write lock described > > earlier) if the source database has any heuristically completed > > branches. > > See above. > > >>> > >> > >> Erm, not true as I mentioned above. If your XAResource proxy is > >> driven > >> through 1PC and there is more than one database then you cannot > >> simply > >> drive them through 1PC - you must run 2PC and act as a subordinate. > >> If > >> you do not then you are risking more heuristic outcomes and still > >> your > >> XAResource proxy will require durability. > > > > As I replied earlier - it's OK, I think, to drive the databases > > through 1PC. > > And, yes, durability does exist in XAResource proxy. > > I disagree. Maybe we need a whiteboard ;-) I think you're still operating under the assumption that each database of the cluster must present a consistent view when accessed externally (i.e. not using HA-JDBC). Again, this is not a supported use case. Otherwise, I don't see how allowing 1PC would risk more heuristic outcomes. > >> Nope. And this is what concerns me. You need to understand what > >> heuristics are about and why they have been developed over the past 4 > >> decades. > > > > I thought the whole point of making a heuristic decision was to > > improve > > resource availability by allow the resource to proceed with the second > > phase of a transaction branch according to the its response to first > > phase, in the event that the transaction manager fails to initiate the > > second phase within some pre-determined time - so that resources are > > not > > blocked forever. What other motivations are there? > > No. A heuristic decision doesn't have to match the original prepare > response. That's why commit can throw heuristic rollback (and that's > not just from commit-one-phase). The traditional (strict) 2PC is a > blocking protocol. Heuristics were introduced to cater for coordinator > failure originally. A participant that has prepared can make any > autonomous choice it wants to after prepare. So it can rollback just > as easily as it can commit. There are no rules apart from the fact > that the participant must remember that decision durably until told to > forget. Understood. > >> The reason that transaction managers don't ignore them and > >> try to "make things right" (or hide them) is because we don't have > >> the > >> necessary semantic information to resolve them automatically. > > > > On the contrary, HA-JDBC does have one vital piece of information that > > the transaction manager does not - the results from other identical > > resources to the same action. > > On the assumption that each of the replicas has identical state, > correct? That *and* given the configured strategy for coordinating responses. Using an ordered strategy, the vital piece of information we have is the response from the master database, with which the responses from the other databases must agree if they are to continue to be active. > > If one database made a heuristic decision > > to commit a transaction, while another database returns a successful > > commit (scenario #2), then I think HA-JDBC *does* have sufficient > > information to forget the heuristically completed branch and return > > normally. > > No because you do not know what else happened to cause the heuristic. > Tell me: what do you think happens to generate a heuristic outcome? > What's going on at the participant? What about other clients that may > be looking at the same database? Have you read up on cascading > rollbacks as well? You seem to be suggesting that heuristically completed branches are allowed to violate isolation semantics. Is this true? > > A transaction manager could not safely make this decision. > > A transaction manager with replication support cannot safely make that > decision either ;-) Again, it depends on the response coordination strategy. If using an ordered strategy, and the master says commit, but the slave makes a heuristic commit, then, by definition, the heuristic decision was correct. If it was the master that made the heuristic commit, and a slave commits normally - then the heuristic commit is what's returned and the slave database is deactivated. Isn't this more in line with what you are expecting? > > > > Likewise, if one database made a heuristic decision to rollback a > > transaction, while another database successfully commit (scenario #3), > > then I think HA-JDBC has sufficient information to conclude that the > > heuristic decision was incorrect and can forget the heuristically > > completed branch on the first database, deactivate the first database, > > and return normally. > > Nope. Then you wouldn't want to use an optimistic response coordination strategy. You would opt for an ordered strategy instead. > What about if one participant says heuristic-commit, another heuristic- > rollback, another heuristic-mixed, another heuristic-hazard and > another commit? What's the state (and it's not commit BTW.) That depends on the configured response coordination strategy: * ordered strategy accepts the response of the "master", regardless of how it responded, and deactivate the others. * optimistic strategy would commit, accepting the commit and heuristic-commit responses, forget the heuristically committed branch, and deactivate the others. * pessimistic strategy would respond with heuristic-rollback, and deactivate the others. * paranoid strategy would stop the cluster and require manual resolution. > >> You're > >> saying that you mask a heuristic database and take it out of the > >> group > >> so it won't cause problems. Later it'll be fixed by having the state > >> overwritten by some good copy. But you are ignoring the fact that you > >> do not have recovery capabilities built in to your XAResource as far > >> as I can tell and are assuming that calling forget and/or > >> deactivating > >> the database will limit the problem. > > > > What recovery capabilities do you think are missing? > > Knowing what caused the heuristic would be a good place to start. Then > knowing what else has been happening from the perspective of other > clients/users in the system. > > Maybe this does come down to some fundamental assumptions that you're > making about the way in which the replicated database will be used. > However, if you are assuming that HA-JDBC is a drop-in replacement for > something like Oracle RAC or even just a single database instance, > then I think your assumptions are flawed. See below. > >> In reality, particularly with > >> heuristics that cause the resource to commit or roll back, there may > >> be systems and services tied in to those database at the back end > >> which are triggered on commit/rollback and go and do something else > >> (e.g., on rollback trigger the sending of a bad credit report on the > >> user). > > > > This is not a valid use case for HA-JDBC. > > Ah, OK. So what are your assumptions. 1. All access to the databases is done via the HA-JDBC driver, i.e. only Java clients are allowed. 2. If multiple JVMs access the same HA-JDBC cluster, then: a. The HA-JDBC configuration for each process must be identical b. The HA-JDBC cluster is designated as distributed (i.e. jgroups enabled), enabling the replication of cluster state. > > One of HA-JDBC's requirements is that all access to the databases of > > the > > cluster go through the HA-JDBC driver. > > But what about concurrent clients on *different* VMs. Today they'd all > use different JDBC driver instances. If they were running within > transactions then we'd get data consistency. Heuristic outcomes would > be resolved by sys admins (typically). What's the equivalent? The equivalent for: * driver - Each VM uses it's own instance of the HA-JDBC driver, which manages a JDBC driver instances to N locations. The cluster state, i.e. set of active databases, is replicated across each instance. * data consistency - No additional mechanism is needed to ensure consistency across databases. Data consistency across databases is maintained via local transaction isolation, combined with consistent execution order across the individual databases in the cluster. Typical database writes will execute against the 1st database in cluster, establishing exclusivity to that data, then the remaining databases in the cluster in parallel. Any subsequent requests to modify the same data will block against the 1st database, removing the possibility of lock contention that might cause data inconsistency across databases. * heuristic resolution - Heuristic outcomes are still resolved by sys admins, but only need to happen those databases considered active. If a heuristic outcome exists on one active database, it will exist on all active databases, as maintained by the response coordination mechanism. > > A trigger like this is > > unsupported since it would: > > a) have to exist on every database - and would result it N bad credit > > reports - BAD. > > b) Given that any database can very well become inactive at any time, > > you cannot rely on a trigger on a single database cluster member. > > Unfortunate, yes, but there's only so much I can support from the > > client > > side. > > OK, so as long as we are clear on the use cases that can be supported > we can make some progress. I'm beginning to think that we may have > come at this discussion from different perspectives ;-) Inevitably... > >> In the real world, the TM would propagate the heuristic back to > >> the end user and the sys admin would then go and either stop the > >> credit report from being sent, or nullify it someone. > > > > Like I said earlier, because HA-JDBC also has the results from other > > identical resources available, it can reasonably determine the > > correctness of a heuristic decision. > > That's still not possible though ;-) > > Your heuristic conflict/resolution matrix is flawed. I get it - you're not a fan of my optimistic response coordination strategy. Would you say the same for the equivalent (and more trivial, i.e. always use column A) matrix for the ordered response coordination strategy? > >> How can he do > >> that? Because he understands the semantics of the operation that was > >> being executed. Does simply making the table entry the same as in one > >> of the copies that didn't throw the heuristic help here? No of course > >> it doesn't. It's just part of the solution (and if you were the > >> person > >> with the bad credit report, I think you'd want the whole compensation > >> not just a part.) > >> > >>> Before the resync - we > >>> recover and forget any prepared/heuristically completed branches. > >> > >> See above. It doesn't help. Believe me: I've been in your position > >> before, back in the early 1990's ;-) > > > > I'll need to read your responses to the above before I'm convinced. > > ;-) > > > > > > >>> If > >>> the XAResource proxy itself throws an exception indicating an > >>> invalid > >>> heuristic decision, only then do we let on to the transaction > >>> manager. > >>> In this case, we are forced to resolve the heuristic, either > >>> automatically (by the transaction manager) > >> > >> The transaction manager will not do this automatically in most cases. > > > > That's fine. The main side effect of an unresolved heuristic decision > > is the inability to activate/synchronize any inactive databases. > > The whole database or just the affected table? Let me clarify - if any *active* databases have any heuristically completed branches, then any attempt to activate an inactive database will be rejected. > >>> Transaction recovery should work fine. > >> > >> Confidence ;-) > >> > >> Have you tried it? > > > > No, actually. The durability logic is still in a development branch > > and > > has yet to be tested thoroughly. > > That's the 20% of code that takes 80% of the time. > > >> We have a very thorough test suite for JBossTS > >> that's been built up over the past 12 years (in Java). More than 50% > >> of it is aimed at crash recovery tests, so it would be good to try it > >> out. > > > > This would be an invaluable resource. > > Check out the DTF. It's an open source project on labs. Thanks. > >> OK, so I'm even more confused than I was given what you said way back > >> (that made me ask the unicast/multicast question). > > > > Oh - sorry - I misunderstood your original question... > > > > By "group-like abstraction for the servers", you meant: does the HA- > > JDBC > > driver use multicast for statement execution? No. > > OK. > > >> Well I'm more concerned with the case where we have concurrent > >> clients > >> who see different replica groups because of network partitioning. So > >> let's assume we have 3 replicas and they're not co-located with any > >> application server. Then assume that we have a partition that splits > >> them 2:1 so that one client sees 1 replica and assumes 2 failures and > >> vice versa. > > > > If your network was asymmetric in such a way, i.e. one client + 2 > > database on one subnet, and 2 clients + 1 databases on another subnet, > > you would still use the local="true|false", where local implies > > local to > > subnet - so this scenario can still be detected. > > Only the clients in the 2 database cluster would be allowed to work? Correct. > >> No you can't. Not unless you've got durability in the XAResource > >> proxy. Do you? If not, can you explain how you can return a > >> definitive > >> answer when the proxy fails? > > > > Yes - there is durability in both the XAResource and Connection > > proxies. > > The log for this is maintained on each client node, and also > > replicated > > via jgroups so that remote nodes can recover from partial commits > > instead of waiting for the crashed node to recover itself. > > When does the durability happen and what is recorded? Depends on the use case and the configured durability level. durability="fine": * JTA, 2PC: - Record Xid and database id before prepare(Xid), per database - Record prepare response after prepare(Xid), per database - Record Xid and database id before commit(Xid, false) or rollback(Xid), per database - Record commit/rollback response per database - Clear above records after proxy processes commit/rollback responses * JTA, 1PC: - Record Xid and database id before commit(Xid, true), per database - Record commit response per database - Clear above records after proxy processes commit responses * non-JTA, autocommit=false: - Record local tx id and database id before commit(), per database - Record commit response per database - Clear above records after proxy processes commit responses * non-JTA, autocommit=true: - Record local tx id and database id before statement execution, per database - Record statement execution response per database - Clear above records after proxy processes statement execution responses durability="coarse": * JTA, 2PC: - Record Xid before prepare(Xid) - Clear above records after proxy processes commit(Xid, false)/rollback(Xid) responses * JTA, 1PC: - Record Xid before commit(Xid, true) - Clear record after proxy processes commit responses * non-JTA, autocommit=false: - Record local tx id before commit() - Clear record after proxy processes commit responses * non-JTA, autocommit=true: - Record local tx id before statement execution - Clear record after proxy processes statement execution responses durability="none" * Nothing recorded > >> I may have more to say on this once we get over the question of > >> durability in your XAResource ;-) > > > > What other questions do you have about durability in HA-JDBC? > > I'll tell you once I hear where the durability happens and what it > saves. Hopefully, I've answered that already. Paul > >> I don't think we're quite finished yet ;-) > > > > Me neither. It occurs to me now that the current durability > > behavior is > > not yet sufficient for XA transactions, since the Xid is only recorded > > before commit/rollback. This will only detect crashes during > > commit/rollback, will not detect heuristic decisions made if the > > client > > node crashes after prepare() but *before* commit/rollback. To address > > this, I really ought to record Xids in prepare(). When the unfinished > > tx is detected (either locally after restart or remotely after jgroups > > membership change), I'll need to perform a XAResource.recover() + > > rollback() for each recorded Xid, and resolve and conflicting results. > > Anything else? > > > Mark. > > --- > Mark Little > ml...@re... > > JBoss, a Division of Red Hat > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > Street, Windsor, Berkshire, SI4 1TE, United Kingdom. > Registered in UK and Wales under Company Registration No. 3798903 > Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt > Parsons (USA) and Brendan Lane (Ireland). |
|
From: Mark L. <ml...@re...> - 2009-03-12 15:53:28
|
Hi Paul. Here's my response for this quarter ;-) On 10 Feb 2009, at 22:05, Paul Ferraro wrote: > On Fri, 2009-01-30 at 12:20 +0000, Mark Little wrote: >> Hi Paul. Sorry for the delay in replying ... > > No worries. Thanks for the thought and time you've put into this > conversation so far. It's interesting. Although I think it would be a lot easier to thrash this out face-to-face: the intermittent responses mean I have a lot of context switching to do ;-) I'm sure the same is true for you. >> >> On the whole database or just on the data that needs updating? The >> former is obviously easier but leads to more false locking, whereas >> the latter can be harder to ensure but does allow for continuous use >> of the remaining database tables (cf page level locking versus object >> level locking.) >> >>> New transactions (both local and XA) >>> acquire a local read locks on begin and release on commit/rollback. >> >> Locking what? > > A single ReadWriteLock (per logical cluster) within the HA-JDBC > driver. OK, but is this on the entire data stored within the cluster instance? > > >>> The >>> acquisition of a distributed write lock effectively blocks any >>> database >>> writes across the whole cluster, while still allowing database >>> reads. >> >> OK, so this sounds like it's a lock on the database itself rather >> than >> any specific data items. However, it also seems like you're not using >> the traditional exclusive writer, concurrent readers. So how do you >> prevent dirty reads of data in that case? Are you assuming optimistic >> locking? > > I'm talking about a client-side RWLock, not a database lock. > Remember, > HA-JDBC is entirely client-side. Good point. I did remember that in some of the initial discussions. > The purpose of the lock is to isolate > transaction boundaries, so that the synchronization process (through > acquisition of the write side of this lock) can ensure that no > transactions are in progress (including local auto-commits). This is > the traditional single writer, concurrent readers - however, in this > case the "writer" is the synchronization process, and the readers are > transaction blocks. Do you have any diagrams or text on how this is supposed to work with multiple clients running on disparate VMs? Or maybe you assume the database will sort that out? > > >>> >>> The activation process is: >>> 1. Acquire distributed write lock >>> 2. Synchronize against an active database using a specified >>> synchronization strategy >>> 3. Broadcast an activation message to all client nodes >> >> I don't quite get the need for 3, but then this may have been covered >> before. Why doesn't the newly activated replica simply rejoin the >> group? Aren't we just using the groupid to communicate and not >> sending >> messages from clients to specific replicas, i.e., multicast versus >> multiple unicasts? (BTW, I'm assuming "client" here means database >> user.) > > Because the replica is not a group member. > A few points of architectural clarification... > I use the term "client node" to refer to a DatabaseCluster object (the > highest abstraction in HA-JDBC). This is the object to which all > end-user JDBC facades interact. The DatabaseCluster manages 1 or more > Database objects - a simple encapsulation for the connectivity details > for a specific database. The Database object comes in 4 flavors: > Driver, DataSource, ConnectionPoolDataSource, and XADataSource Yes, that much you can assume to be taken for granted ;-) > > When the client requests a connection from the HA-JDBC driver, HA-JDBC > DataSource, or app server DataSource backed by an HA-JDBC > ConnectionPoolDataSource or XADataSource, the object returned is a > dynamic proxy backed by the DatabaseCluster, containing a map of > Connections to each active database in the cluster. The Connection > proxy begets Statement proxies, etc. > One of the prime duties of the DatabaseCluster is managing the cluster > state, i.e. the set of active databases. This is represented as a set > of string identifiers. > In the distributed use case, i.e. where the HA-JDBC config file > defines > the cluster as <distributable/> (a la web.xml), the cluster state is > replicated all participating application servers via jgroups. The > jgroups membership is the set of app servers accessing a given HA-JDBC > cluster (not the replicas themselves). The activation process I > described above occurs on a given node of the application server > cluster. The activation message to which I referred tells each group > member to update its state, e.g. from [db1, db2] -> [db1, db2, db3] OK. One further clarification: in step 1 this is a lock on the entire cluster (all database instances) so any clients running anywhere else are blocked until the group view membership change has completed? >>> >> >> No, you misunderstand. If you are proxying as you mentioned, then >> you're effectively acting as a subordinate coordinator. Your >> XAResource is hiding 1..N different XAResource instances. In the case >> of the transaction manager, it sees only your XAResource. This means >> that if there are no other XAResources involved in the transaction, >> the coordinator is not going to drive your XAResource through 2PC: >> it'll call commit(true) and trigger the one-phase optimization. Now >> your XAResource knows that there are 1..N instances in reality that >> need coordinating. Let's assume it's 2..N because 1..1 is the >> degenerate case and is trivial to manage. This means that your >> XAResource is a subordinate coordinator and must run 2PC on behalf of >> the root coordinator. But unfortunately the route through the 2PC >> protocol allows for a number of different error conditions than the >> route through the 1PC termination protocol. So your XAResource will >> have to translate 2PC errors, for example, into 1PC (and sometimes >> that mapping is not one-to-one.) Do you handle that? > > I see what you're saying now - but I disagree with your assertion that > HA-JDBC's XAResource proxy "must run 2PC on behalf of the root > coordinator" in the single phase optimization scenario. > The point of 2PC is to ensure consensus that a unit of work can be > committed across multiple resources. I think you can assume I know the point of 2PC ;-) > In HA-JDBC, since the resources > manged by the proxy are presumed to be identical by definition, the > prepare phase is unnecessary since we expect the response to be > unanimous one way or the other. Therefore, the single phase > optimization is entirely appropriate. Nope, you're wrong. You can make all of the assumptions you want, but in reality they are worthless. Failures may happen for a number of reasons and not result in crashes. You cannot assume that because all of the replicas have the same initial state and receive the same set of messages in the same order that they will always come to the same final state. That's an assumption we make in many active replication protocols and it's fine in a very select environment. But throw in time, subatomic/high energy particles, and the 3rd law of thermodynamics, and it falls down. Which is why primary copy/ coordinator-cohort replication is the dominant replica consistency protocol in the world. Take a look at some of the papers I referenced in initial emails. Many of them talk about integrating transactions with active replication. If your approach were valid then we wouldn't need transactions at all, now would we ;-) ? > > >>> >>> >>>> As a slight aside: what do you do about non-deterministic SQL >>>> expressions such as may be involved with the current time? >>> >>> HA-JDBC can be configured to replace non-deterministic SQL >>> expressions >>> with client generated values. This behavior is enabled per >>> expression >>> type (e.g. eval-current-timestamp="true"): >>> http://ha-jdbc.sourceforge.net/doc.html#cluster >>> Of course, to avoid the cost of regular expression parsing, it's >>> best to >>> avoid these functions altogether. >> >> Yes, but if we want to opaquely replace JDBC with HA-JDBC then we >> need >> to assume that the client won't be doing anything to help us. > > The client never needs to provide runtime hints on behalf of HA-JDBC. > Non-deterministic expression evaluation is enabled/disabled in the > HA-JDBC configuration file. How does it know which statements to look for? >>> >> >> What are the performance implications so far and with your future >> plans? > > Currently, a given "next value for sequence" requires a jgroups rpc > call > to acquire an exclusive lock (per sequence) on each client node, and > an > subsequent rpc call to release the lock. > The future approach doesn't improve performance (makes it slightly > worse > actually, by requiring an additional SQL statement), but rather > reliability, since the first approach doesn't verify that the next > sequence value was actually the same on each database, but trusts that > it is, given its exclusive access. As I write this, it occurs to me > that Statement.getGeneratedKeys() can still be leveraged to verify > consistency without requiring separate ALTER SEQUENCE statements on > the > "slave" databases. So at this point, I need to think about this more. OK. I'm interested. >>> >> >> This should mean more than deactivating the database. If one resource >> thinks that there was no work done on the data it controls whereas >> the >> other does, that's a serious issue and you should rollback! commit >> isn't always the right way forward! This is different to some of the >> cases below because there are no errors reported from the resource. >> As >> such the problem is at the application level, or somewhere in the >> middleware stack. That fact alone means that anything the committable >> resource has done is suspect immediately. > > Good point. #3 is really an exception condition, since it violates > HA-JDBC's assumption that all databases are identical. See what I meant above ;-) ? > There should > really be a more flexible mechanism for handling this. Perhaps a > configuration option to determine how to HA-JDBC should resolve > conflicts. Possible values for this would be: > * optimistic: Assume that user transaction expects success (current > behavior) > * pessimistic: Rollback if possible, otherwise halt cluster and log > conflict. > * paranoid: Halt cluster, log conflict. This would worry me if I were a client. "I'm using transactions for a reason. Please don't break the ACID semantics without informing me." ;) > > * chain-of-command: Assume the result of the 1st available database is > the authoritative response - deactivate any deviants Have you looked at Available Copies replication. > > > A "chain-of-command" strategy is really a more appropriate default for > HA-JDBC, as opposed to the current optimistic behavior. Maybe. > > >>> >>> 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 >>> 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 >>> 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) >>> 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), >>> rollback and deactivate db1 >>> 8 XA_OK [XAER_*] XA_OK, deactivate db2 >>> 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 >>> 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 >>> 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the >>> same >> >> Hmmm, well that's not good either, because XAER_* errors can mean >> very >> different things (in the standard as well as in the implementations). >> So simply assuming that all of the databases are still in a position >> where they can be treated as a single instance is not necessarily a >> good idea. > > Good point. If the XAER_* error codes are not the same, I should > respond using the configured conflict resolution strategy. > Optimistic strategy would keep the database(s) with the highest error > code (i.e. least severe error), and deactivate the others. > Pessimistic/Paranoid strategy would halt the cluster. > Chain-of-command strategy would return the XAER_* from db1 and > deactivate db2. Yes. > > >> I understand all of what you're trying to do (it's just using the >> available copies replication protocol, which has been around for >> several decades.) But even that required a level of durability at the >> recipient in order to deal with failures between it and the client. >> The recipient in this case is your XAResource proxy. > > This durability does exists both in the XAResource proxy and the > Connection proxy (for non-JTA transactions) - see the section on > protected commit mode. Isn't that sufficient? Not if you're not running 2PC between the XAResource instances within the proxy. Consider failures of the proxy as well as failures of the participants. >> > > If forget fails, the error is logged and db2 is deactivated. > If the proxy crashes, then we follow the "protected-commit" logic > discussed earlier. Can you take me through a scenario where you have a proxy crash while interacting with the participants? > >>> >>> 6 void [XAER_*] void, deactivate db2 >>> 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] >>> 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 >> >>> >>> 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate >>> db2 >> >> This is more concerning because you've got a mixed heuristic outcome >> which means that this resource was probably a subordinate coordinator >> itself or modifying multiple entries in the database. That'll make >> automatic recovery interesting unless you expect to copy one entire >> database image over the top of another? > > During synchronization prior to database activation, all prepared and > heuristically completed branches on the target database are forgotten. But you can't just forget autonomously. Have you looked at how heuristic resolution works in the real world? If we could simply call forget as soon as we get a heuristic then it's be easy ;-) Unfortunately heuristic resolution often requires semantic information about the application and data, which typically means manual intervention. A heuristic commit is not the same as a commit, even if commit was called by the application. Likewise for a heuristic rollback. > > Synchronization will not occur (i.e. not be allowed to start, since it > won't be able to acquire the DatabaseCluster's write lock described > earlier) if the source database has any heuristically completed > branches. See above. >>> >> >> Erm, not true as I mentioned above. If your XAResource proxy is >> driven >> through 1PC and there is more than one database then you cannot >> simply >> drive them through 1PC - you must run 2PC and act as a subordinate. >> If >> you do not then you are risking more heuristic outcomes and still >> your >> XAResource proxy will require durability. > > As I replied earlier - it's OK, I think, to drive the databases > through 1PC. > And, yes, durability does exist in XAResource proxy. I disagree. Maybe we need a whiteboard ;-) >>> >> >> Nope. And this is what concerns me. You need to understand what >> heuristics are about and why they have been developed over the past 4 >> decades. > > I thought the whole point of making a heuristic decision was to > improve > resource availability by allow the resource to proceed with the second > phase of a transaction branch according to the its response to first > phase, in the event that the transaction manager fails to initiate the > second phase within some pre-determined time - so that resources are > not > blocked forever. What other motivations are there? No. A heuristic decision doesn't have to match the original prepare response. That's why commit can throw heuristic rollback (and that's not just from commit-one-phase). The traditional (strict) 2PC is a blocking protocol. Heuristics were introduced to cater for coordinator failure originally. A participant that has prepared can make any autonomous choice it wants to after prepare. So it can rollback just as easily as it can commit. There are no rules apart from the fact that the participant must remember that decision durably until told to forget. > > >> The reason that transaction managers don't ignore them and >> try to "make things right" (or hide them) is because we don't have >> the >> necessary semantic information to resolve them automatically. > > On the contrary, HA-JDBC does have one vital piece of information that > the transaction manager does not - the results from other identical > resources to the same action. On the assumption that each of the replicas has identical state, correct? > If one database made a heuristic decision > to commit a transaction, while another database returns a successful > commit (scenario #2), then I think HA-JDBC *does* have sufficient > information to forget the heuristically completed branch and return > normally. No because you do not know what else happened to cause the heuristic. Tell me: what do you think happens to generate a heuristic outcome? What's going on at the participant? What about other clients that may be looking at the same database? Have you read up on cascading rollbacks as well? > A transaction manager could not safely make this decision. A transaction manager with replication support cannot safely make that decision either ;-) > > Likewise, if one database made a heuristic decision to rollback a > transaction, while another database successfully commit (scenario #3), > then I think HA-JDBC has sufficient information to conclude that the > heuristic decision was incorrect and can forget the heuristically > completed branch on the first database, deactivate the first database, > and return normally. Nope. What about if one participant says heuristic-commit, another heuristic- rollback, another heuristic-mixed, another heuristic-hazard and another commit? What's the state (and it's not commit BTW.) > > >> You're >> saying that you mask a heuristic database and take it out of the >> group >> so it won't cause problems. Later it'll be fixed by having the state >> overwritten by some good copy. But you are ignoring the fact that you >> do not have recovery capabilities built in to your XAResource as far >> as I can tell and are assuming that calling forget and/or >> deactivating >> the database will limit the problem. > > What recovery capabilities do you think are missing? Knowing what caused the heuristic would be a good place to start. Then knowing what else has been happening from the perspective of other clients/users in the system. Maybe this does come down to some fundamental assumptions that you're making about the way in which the replicated database will be used. However, if you are assuming that HA-JDBC is a drop-in replacement for something like Oracle RAC or even just a single database instance, then I think your assumptions are flawed. > > >> In reality, particularly with >> heuristics that cause the resource to commit or roll back, there may >> be systems and services tied in to those database at the back end >> which are triggered on commit/rollback and go and do something else >> (e.g., on rollback trigger the sending of a bad credit report on the >> user). > > This is not a valid use case for HA-JDBC. Ah, OK. So what are your assumptions. > > One of HA-JDBC's requirements is that all access to the databases of > the > cluster go through the HA-JDBC driver. But what about concurrent clients on *different* VMs. Today they'd all use different JDBC driver instances. If they were running within transactions then we'd get data consistency. Heuristic outcomes would be resolved by sys admins (typically). What's the equivalent? > A trigger like this is > unsupported since it would: > a) have to exist on every database - and would result it N bad credit > reports - BAD. > b) Given that any database can very well become inactive at any time, > you cannot rely on a trigger on a single database cluster member. > Unfortunate, yes, but there's only so much I can support from the > client > side. OK, so as long as we are clear on the use cases that can be supported we can make some progress. I'm beginning to think that we may have come at this discussion from different perspectives ;-) > > >> In the real world, the TM would propagate the heuristic back to >> the end user and the sys admin would then go and either stop the >> credit report from being sent, or nullify it someone. > > Like I said earlier, because HA-JDBC also has the results from other > identical resources available, it can reasonably determine the > correctness of a heuristic decision. That's still not possible though ;-) Your heuristic conflict/resolution matrix is flawed. > > >> How can he do >> that? Because he understands the semantics of the operation that was >> being executed. Does simply making the table entry the same as in one >> of the copies that didn't throw the heuristic help here? No of course >> it doesn't. It's just part of the solution (and if you were the >> person >> with the bad credit report, I think you'd want the whole compensation >> not just a part.) >> >>> Before the resync - we >>> recover and forget any prepared/heuristically completed branches. >> >> See above. It doesn't help. Believe me: I've been in your position >> before, back in the early 1990's ;-) > > I'll need to read your responses to the above before I'm convinced. ;-) > > >>> If >>> the XAResource proxy itself throws an exception indicating an >>> invalid >>> heuristic decision, only then do we let on to the transaction >>> manager. >>> In this case, we are forced to resolve the heuristic, either >>> automatically (by the transaction manager) >> >> The transaction manager will not do this automatically in most cases. > > That's fine. The main side effect of an unresolved heuristic decision > is the inability to activate/synchronize any inactive databases. The whole database or just the affected table? >>> >>> Transaction recovery should work fine. >> >> Confidence ;-) >> >> Have you tried it? > > No, actually. The durability logic is still in a development branch > and > has yet to be tested thoroughly. That's the 20% of code that takes 80% of the time. > > >> We have a very thorough test suite for JBossTS >> that's been built up over the past 12 years (in Java). More than 50% >> of it is aimed at crash recovery tests, so it would be good to try it >> out. > > This would be an invaluable resource. Check out the DTF. It's an open source project on labs. >>> >> >> OK, so I'm even more confused than I was given what you said way back >> (that made me ask the unicast/multicast question). > > Oh - sorry - I misunderstood your original question... > > By "group-like abstraction for the servers", you meant: does the HA- > JDBC > driver use multicast for statement execution? No. OK. >>> >> >> Well I'm more concerned with the case where we have concurrent >> clients >> who see different replica groups because of network partitioning. So >> let's assume we have 3 replicas and they're not co-located with any >> application server. Then assume that we have a partition that splits >> them 2:1 so that one client sees 1 replica and assumes 2 failures and >> vice versa. > > If your network was asymmetric in such a way, i.e. one client + 2 > database on one subnet, and 2 clients + 1 databases on another subnet, > you would still use the local="true|false", where local implies > local to > subnet - so this scenario can still be detected. Only the clients in the 2 database cluster would be allowed to work? >>> >> >> No you can't. Not unless you've got durability in the XAResource >> proxy. Do you? If not, can you explain how you can return a >> definitive >> answer when the proxy fails? > > Yes - there is durability in both the XAResource and Connection > proxies. > The log for this is maintained on each client node, and also > replicated > via jgroups so that remote nodes can recover from partial commits > instead of waiting for the crashed node to recover itself. When does the durability happen and what is recorded? >>> >> >> I may have more to say on this once we get over the question of >> durability in your XAResource ;-) > > What other questions do you have about durability in HA-JDBC? I'll tell you once I hear where the durability happens and what it saves. >>> >>> >> >> I don't think we're quite finished yet ;-) > > Me neither. It occurs to me now that the current durability > behavior is > not yet sufficient for XA transactions, since the Xid is only recorded > before commit/rollback. This will only detect crashes during > commit/rollback, will not detect heuristic decisions made if the > client > node crashes after prepare() but *before* commit/rollback. To address > this, I really ought to record Xids in prepare(). When the unfinished > tx is detected (either locally after restart or remotely after jgroups > membership change), I'll need to perform a XAResource.recover() + > rollback() for each recorded Xid, and resolve and conflicting results. > Anything else? Mark. --- Mark Little ml...@re... JBoss, a Division of Red Hat Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland). |
|
From: Paul F. <pau...@re...> - 2009-02-10 22:46:38
|
On Fri, 2009-01-30 at 12:20 +0000, Mark Little wrote: > Hi Paul. Sorry for the delay in replying ... No worries. Thanks for the thought and time you've put into this conversation so far. > On 9 Dec 2008, at 18:33, Paul Ferraro wrote: > > >> > >> OK, so failures are agreed across all clients and all members of the > >> replica group? What's the protocol for rejoining the group? Does > >> there > >> have to be a quiescent period, for instance? > > > > Database deactivations (i.e., the action taken in response to singular > > failures) are broadcast to all client nodes. This is not so much an > > agreement as it is a decree - it cannot be vetoed. A replica group > > member (i.e. a database) has no knowledge of the clients, nor other > > replica group members. > > Rejoining the replica group (i.e. for a database to become active > > again), does indeed require a quiescent period. This is enforced by a > > distributed read-write lock. > > On the whole database or just on the data that needs updating? The > former is obviously easier but leads to more false locking, whereas > the latter can be harder to ensure but does allow for continuous use > of the remaining database tables (cf page level locking versus object > level locking.) > > > New transactions (both local and XA) > > acquire a local read locks on begin and release on commit/rollback. > > Locking what? A single ReadWriteLock (per logical cluster) within the HA-JDBC driver. > > The > > acquisition of a distributed write lock effectively blocks any > > database > > writes across the whole cluster, while still allowing database reads. > > OK, so this sounds like it's a lock on the database itself rather than > any specific data items. However, it also seems like you're not using > the traditional exclusive writer, concurrent readers. So how do you > prevent dirty reads of data in that case? Are you assuming optimistic > locking? I'm talking about a client-side RWLock, not a database lock. Remember, HA-JDBC is entirely client-side. The purpose of the lock is to isolate transaction boundaries, so that the synchronization process (through acquisition of the write side of this lock) can ensure that no transactions are in progress (including local auto-commits). This is the traditional single writer, concurrent readers - however, in this case the "writer" is the synchronization process, and the readers are transaction blocks. > > > > The activation process is: > > 1. Acquire distributed write lock > > 2. Synchronize against an active database using a specified > > synchronization strategy > > 3. Broadcast an activation message to all client nodes > > I don't quite get the need for 3, but then this may have been covered > before. Why doesn't the newly activated replica simply rejoin the > group? Aren't we just using the groupid to communicate and not sending > messages from clients to specific replicas, i.e., multicast versus > multiple unicasts? (BTW, I'm assuming "client" here means database > user.) Because the replica is not a group member. A few points of architectural clarification... I use the term "client node" to refer to a DatabaseCluster object (the highest abstraction in HA-JDBC). This is the object to which all end-user JDBC facades interact. The DatabaseCluster manages 1 or more Database objects - a simple encapsulation for the connectivity details for a specific database. The Database object comes in 4 flavors: Driver, DataSource, ConnectionPoolDataSource, and XADataSource When the client requests a connection from the HA-JDBC driver, HA-JDBC DataSource, or app server DataSource backed by an HA-JDBC ConnectionPoolDataSource or XADataSource, the object returned is a dynamic proxy backed by the DatabaseCluster, containing a map of Connections to each active database in the cluster. The Connection proxy begets Statement proxies, etc. One of the prime duties of the DatabaseCluster is managing the cluster state, i.e. the set of active databases. This is represented as a set of string identifiers. In the distributed use case, i.e. where the HA-JDBC config file defines the cluster as <distributable/> (a la web.xml), the cluster state is replicated all participating application servers via jgroups. The jgroups membership is the set of app servers accessing a given HA-JDBC cluster (not the replicas themselves). The activation process I described above occurs on a given node of the application server cluster. The activation message to which I referred tells each group member to update its state, e.g. from [db1, db2] -> [db1, db2, db3] > > > > 4. Release distributed write lock > > > > N.B. Suspended transactions will block the activation process, since a > > newly joined replica will have missed the prepare phase. > > Heuristically completed transactions (as defined by the XAResource > > proxy > > (discussed below) will also block the activation process. > > OK, that sounds good. > > >>> > >>> This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC > >>> exposes a single XAResource for the whole database cluster to the > >>> transaction manager, while internally it manages an XAResource per > >>> database. > >> > >> OK, so this single XAResource acts like a transaction coordinator > >> then? With a log? > > > > Nope. The XAResource HA-JDBC exposes to the transaction manager is > > merely a proxy to a set of underlying XAResources (one per database) > > that implements HA behavior. It has no log. I'll go into detail > > about > > how this HA logic handles heuristic decisions later... > > OK, I'll wait ;-) > > > > > > >> Plus, the TM will not necessarily go through 2PC, > >> but may in fact use a 1PC protocol. The problem here is that there > >> are > >> actually multiple resources registered (c.f. interposition or > >> subordinate coordinators). The failures that can be returned from a > >> 1PC interaction on an XAResource are actually different than those > >> that can be returned during the full 2PC. What this means is that if > >> you run into these failures behind your XAResource veneer, you will > >> be > >> incapable of giving the right information (right error codes and/or > >> exceptions) back to the transaction manager and hence to the > >> application. > > > > 1PC is any different from the perspective of HA-JDBC's XAResource > > proxy. > > For a given method invocation, HA-JDBC's XAResource proxy coordinates > > the results from the individual XAResources it manages, deactivating > > the > > corresponding database if it responds anomalously (including > > exceptions > > with anomalous error codes). Yes - 1PC commits can throw additional > > XA_RB* exceptions, but the corresponding HA logic is algorithmically > > the > > same. > > No, you misunderstand. If you are proxying as you mentioned, then > you're effectively acting as a subordinate coordinator. Your > XAResource is hiding 1..N different XAResource instances. In the case > of the transaction manager, it sees only your XAResource. This means > that if there are no other XAResources involved in the transaction, > the coordinator is not going to drive your XAResource through 2PC: > it'll call commit(true) and trigger the one-phase optimization. Now > your XAResource knows that there are 1..N instances in reality that > need coordinating. Let's assume it's 2..N because 1..1 is the > degenerate case and is trivial to manage. This means that your > XAResource is a subordinate coordinator and must run 2PC on behalf of > the root coordinator. But unfortunately the route through the 2PC > protocol allows for a number of different error conditions than the > route through the 1PC termination protocol. So your XAResource will > have to translate 2PC errors, for example, into 1PC (and sometimes > that mapping is not one-to-one.) Do you handle that? I see what you're saying now - but I disagree with your assertion that HA-JDBC's XAResource proxy "must run 2PC on behalf of the root coordinator" in the single phase optimization scenario. The point of 2PC is to ensure consensus that a unit of work can be committed across multiple resources. In HA-JDBC, since the resources manged by the proxy are presumed to be identical by definition, the prepare phase is unnecessary since we expect the response to be unanimous one way or the other. Therefore, the single phase optimization is entirely appropriate. > > > > > >> As a slight aside: what do you do about non-deterministic SQL > >> expressions such as may be involved with the current time? > > > > HA-JDBC can be configured to replace non-deterministic SQL expressions > > with client generated values. This behavior is enabled per expression > > type (e.g. eval-current-timestamp="true"): > > http://ha-jdbc.sourceforge.net/doc.html#cluster > > Of course, to avoid the cost of regular expression parsing, it's > > best to > > avoid these functions altogether. > > Yes, but if we want to opaquely replace JDBC with HA-JDBC then we need > to assume that the client won't be doing anything to help us. The client never needs to provide runtime hints on behalf of HA-JDBC. Non-deterministic expression evaluation is enabled/disabled in the HA-JDBC configuration file. > > > > > >> Plus have > >> you thought about serializability or linearizability issues across > >> the > >> replicas when multiple clients are involved? > > > > Yes. The biggest headaches are sequences and identity columns. > > Currently, sequences next value functions and inserts into tables > > containing identity columns require distributed synchronous access > > (per > > sequence name, or per table name). This is very cumbersome, so I > > generally encourage HA-JDBC users to use alternative, more > > cluster-friendly identifier generation mechanisms. The cost for > > sequences can be mitigated somewhat using a hi-lo algorithm (e.g. > > SequenceHiLoGenerator in Hibernate). > > For databases where DatabaseMetaData.supportsGetGeneratedKeys() = > > true, > > I plan to implement a master-slave approach, where the sequence next > > value is executed against the master database, and alter sequence > > statement are executed against the slaves. This is a cleaner than the > > current approach, though vendor support for > > Statement.getGeneratedKeys() > > is still fairly sparse. > > What are the performance implications so far and with your future plans? Currently, a given "next value for sequence" requires a jgroups rpc call to acquire an exclusive lock (per sequence) on each client node, and an subsequent rpc call to release the lock. The future approach doesn't improve performance (makes it slightly worse actually, by requiring an additional SQL statement), but rather reliability, since the first approach doesn't verify that the next sequence value was actually the same on each database, but trusts that it is, given its exclusive access. As I write this, it occurs to me that Statement.getGeneratedKeys() can still be leveraged to verify consistency without requiring separate ALTER SEQUENCE statements on the "slave" databases. So at this point, I need to think about this more. > >>> Consequently, HA-JDBC can tolerate asymmetric failures in any > >>> phase of a transaction (i.e. a phase fails against some databases, > >>> but > >>> succeeds against others) without forcing the whole transaction to > >>> fail, > >>> or become in-doubt, by pessimistically deactivating the failed > >>> databases. > >> > >> I'm not so convinced that the impact on the transaction is so opaque, > >> as I mentioned above. Obviously what we need to do is make a replica > >> group appear to be identical to a single instance only more > >> available. > >> Simply gluing together multiple instances behind a single XAResource > >> abstraction doesn't do it. Plus what about transaction recovery? By > >> hiding the fact that some replicas may have generated heuristic > >> outcomes whereas others haven't (because of failures), the > >> transaction > >> manager may go ahead and delete the log because it thinks that the > >> transaction has completed successfully. That would be a very bad > >> thing > >> to do (think of crossing the streams in Ghostbusters ;-) Simply > >> overwriting the failed heuristic replica db with a fresh copy from a > >> successful db replica would also be a very bad thing to do (think > >> letting the Keymaster and Gozer meet in Ghostbusters) because > >> heuristics need to be resolved by the sys admin and the transaction > >> log should not vanish until that has happened. So some kinds of > >> failures of dbs need to be treated differently than just deactivating > >> them. Can you cope with that? > > > > Perhaps I should go into more detail about what happens behind the > > XAResource proxy. One of the primary responsibilities of the > > XAResource > > proxy is to coordinate the return values and/or exceptions from a > > given > > method invocation - and present them to the caller (in this case, the > > transaction manager) as a single return value or exception. > > Yes, I kind of figured that, since it's a subordinate coordinator (see > above). But if you're not recording these outcomes durably then your > proxy is going to cause problems itself for transaction integrity. Or > can you explain to me what happens in the situation where the proxy > has obtained some outcomes from the 1..N XAResources but not all and > then it crashes? I've already described this (i.e. protected commit mode) at some other point in the thread. We record the result of the commits for each database as it completes. In the single client node case (i.e. non-distributed), this situation (i.e. partial commit completion) is detected when HA-JDBC restarts. In the distributed case, the commit results are also replicated to remote client nodes via jgroups. The remote nodes will detect the incomplete commits when the they detect the group membership change. > > Let's look at a couple of result matrices of return values and error > > codes for each XAResource for a replica group of 2 databases and the > > coordinated result from the HA-JDBC XAResource proxy: > > N.B. I've numbered each scenario for discussion purposes. > > I've used wildcard (*) in some error codes for brevity > > Exception codes are distinguished from return values using [] > > > > prepare(Xid): > > > > REF DB1 DB2 HA-JDBC > > 1 XA_OK XA_OK XA_OK > > 2 XA_RDONLY XA_RDONLY XA_RDONLY > > 3 XA_OK XA_RDONLY XA_OK, deactivate db2 > > This should mean more than deactivating the database. If one resource > thinks that there was no work done on the data it controls whereas the > other does, that's a serious issue and you should rollback! commit > isn't always the right way forward! This is different to some of the > cases below because there are no errors reported from the resource. As > such the problem is at the application level, or somewhere in the > middleware stack. That fact alone means that anything the committable > resource has done is suspect immediately. Good point. #3 is really an exception condition, since it violates HA-JDBC's assumption that all databases are identical. There should really be a more flexible mechanism for handling this. Perhaps a configuration option to determine how to HA-JDBC should resolve conflicts. Possible values for this would be: * optimistic: Assume that user transaction expects success (current behavior) * pessimistic: Rollback if possible, otherwise halt cluster and log conflict. * paranoid: Halt cluster, log conflict. * chain-of-command: Assume the result of the 1st available database is the authoritative response - deactivate any deviants A "chain-of-command" strategy is really a more appropriate default for HA-JDBC, as opposed to the current optimistic behavior. > > > > 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 > > 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 > > 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) > > 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), > > rollback and deactivate db1 > > 8 XA_OK [XAER_*] XA_OK, deactivate db2 > > 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 > > 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 > > 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > > same > > Hmmm, well that's not good either, because XAER_* errors can mean very > different things (in the standard as well as in the implementations). > So simply assuming that all of the databases are still in a position > where they can be treated as a single instance is not necessarily a > good idea. Good point. If the XAER_* error codes are not the same, I should respond using the configured conflict resolution strategy. Optimistic strategy would keep the database(s) with the highest error code (i.e. least severe error), and deactivate the others. Pessimistic/Paranoid strategy would halt the cluster. Chain-of-command strategy would return the XAER_* from db1 and deactivate db2. > I understand all of what you're trying to do (it's just using the > available copies replication protocol, which has been around for > several decades.) But even that required a level of durability at the > recipient in order to deal with failures between it and the client. > The recipient in this case is your XAResource proxy. This durability does exists both in the XAResource proxy and the Connection proxy (for non-JTA transactions) - see the section on protected commit mode. Isn't that sufficient? > > > > > > In summary, when the return values/exceptions from each database do > > not > > match, HA-JDBC makes a heuristic decision of it's own. HA-JDBC makes > > the reasonable assumption that the JDBC client (i.e. application) > > expects a given database transaction to succeed, therefore, an XA_OK > > is > > preferable to a rollback request (i.e. XA_RB* exception). > > +1 > > > > > > > commit(Xid, false): > > > > REF DB1 DB2 HA-JDBC > > 1 void void void > > 2 void [XA_HEURCOM] void, forget db2 > > What happens if the call to forget fails, or your proxy crashes before > it can be made? If forget fails, the error is logged and db2 is deactivated. If the proxy crashes, then we follow the "protected-commit" logic discussed earlier. > > > > > 3 void [XA_HEURRB] void, forget and deactivate db2 > > 4 void [XA_HEURMIX] void, forget and deactivate db2 > > 5 void [XA_HEURHAZ] void, forget and deactivate db2 > > Same question as for 2. If forget fails, we log the failure and proceed to deactivate db2. > > > > 6 void [XAER_*] void, deactivate db2 > > 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] > > 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 > > > > > 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate db2 > > This is more concerning because you've got a mixed heuristic outcome > which means that this resource was probably a subordinate coordinator > itself or modifying multiple entries in the database. That'll make > automatic recovery interesting unless you expect to copy one entire > database image over the top of another? During synchronization prior to database activation, all prepared and heuristically completed branches on the target database are forgotten. Synchronization will not occur (i.e. not be allowed to start, since it won't be able to acquire the DatabaseCluster's write lock described earlier) if the source database has any heuristically completed branches. > > > > 10 [XA_HEURCOM] [XA_HEURHAZ] [XA_HEURCOM], forget and deactivate db2 > > As above, though this time we don't know what state db2 is in. > > > > > 11 [XA_HEURCOM] [XAER_*] [XA_HEURCOM], deactivate db2 > > 12 [XA_HEURRB] [XA_HEURRB] [XA_HEURRB] > > 13 [XA_HEURRB] [XA_HEURMIX] [XA_HEURRB], forget and deactivate db2 > > 14 [XA_HEURRB] [XA_HEURHAZ] [XA_HEURRB], forget and deactivate db2 > > 15 [XA_HEURRB] [XAER_*] [XA_HEURRB], deactivate db2 > > 16 [XA_HEURMIX] [XA_HEURMIX] [XA_HEURMIX], forget and deactivate > > all but one db > > 17 [XA_HEURMIX] [XA_HEURHAZ] [XA_HEURMIX], forget and deactivate db2 > > 18 [XA_HEURMIX] [XAER_*] [XA_HEURMIX], deactivate db2 > > 19 [XA_HEURHAZ] [XA_HEURHAZ] [XA_HEURHAZ]. forget and deactivate > > all but one db > > 20 [XA_HEURHAZ] [XAER_*] [XA_HEURHAZ], deactivate db2 > > 21 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > > same > > > > commit(Xid, true): > > N.B. There are no heuristic outcomes in the 1PC case, since there is > > no > > prepare. > > Erm, not true as I mentioned above. If your XAResource proxy is driven > through 1PC and there is more than one database then you cannot simply > drive them through 1PC - you must run 2PC and act as a subordinate. If > you do not then you are risking more heuristic outcomes and still your > XAResource proxy will require durability. As I replied earlier - it's OK, I think, to drive the databases through 1PC. And, yes, durability does exist in XAResource proxy. > > > > > > DB1 DB2 HA-JDBC > > void void void > > void XA_RB* void, deactivate db2 > > XA_RB* XA_RB* XA_RB* (only if error code matches) > > XA_RBBASE XA_RBEND XA_RBEND (prefer higher error code), rollback > > and deactivate db1 > > > > In general, the HA logic picks the best outcomes, auto-forgets > > heuristic > > decisions that turn out to be correct, aggressively deactivates the > > deviants from the replica group, and tolerates at most 1 > > incorrect/mixed/hazard heuristic decision to bubble back to the > > transaction manager. > > > > I don't think that your Ghostbusters analogy is very accurate. Hiding > > heuristic decisions (e.g. #2-5) from the transaction manager is not > > dangerous in the realm of HA-JDBC. If a heuristic decision turns > > out to > > be incorrect, then we can deactivate that database. It will get > > re-sync'ed before it is reactivated, so the fact that a heuristic > > decision was made at all is no longer relevant. > > Nope. And this is what concerns me. You need to understand what > heuristics are about and why they have been developed over the past 4 > decades. I thought the whole point of making a heuristic decision was to improve resource availability by allow the resource to proceed with the second phase of a transaction branch according to the its response to first phase, in the event that the transaction manager fails to initiate the second phase within some pre-determined time - so that resources are not blocked forever. What other motivations are there? > The reason that transaction managers don't ignore them and > try to "make things right" (or hide them) is because we don't have the > necessary semantic information to resolve them automatically. On the contrary, HA-JDBC does have one vital piece of information that the transaction manager does not - the results from other identical resources to the same action. If one database made a heuristic decision to commit a transaction, while another database returns a successful commit (scenario #2), then I think HA-JDBC *does* have sufficient information to forget the heuristically completed branch and return normally. A transaction manager could not safely make this decision. Likewise, if one database made a heuristic decision to rollback a transaction, while another database successfully commit (scenario #3), then I think HA-JDBC has sufficient information to conclude that the heuristic decision was incorrect and can forget the heuristically completed branch on the first database, deactivate the first database, and return normally. > You're > saying that you mask a heuristic database and take it out of the group > so it won't cause problems. Later it'll be fixed by having the state > overwritten by some good copy. But you are ignoring the fact that you > do not have recovery capabilities built in to your XAResource as far > as I can tell and are assuming that calling forget and/or deactivating > the database will limit the problem. What recovery capabilities do you think are missing? > In reality, particularly with > heuristics that cause the resource to commit or roll back, there may > be systems and services tied in to those database at the back end > which are triggered on commit/rollback and go and do something else > (e.g., on rollback trigger the sending of a bad credit report on the > user). This is not a valid use case for HA-JDBC. One of HA-JDBC's requirements is that all access to the databases of the cluster go through the HA-JDBC driver. A trigger like this is unsupported since it would: a) have to exist on every database - and would result it N bad credit reports - BAD. b) Given that any database can very well become inactive at any time, you cannot rely on a trigger on a single database cluster member. Unfortunate, yes, but there's only so much I can support from the client side. > In the real world, the TM would propagate the heuristic back to > the end user and the sys admin would then go and either stop the > credit report from being sent, or nullify it someone. Like I said earlier, because HA-JDBC also has the results from other identical resources available, it can reasonably determine the correctness of a heuristic decision. > How can he do > that? Because he understands the semantics of the operation that was > being executed. Does simply making the table entry the same as in one > of the copies that didn't throw the heuristic help here? No of course > it doesn't. It's just part of the solution (and if you were the person > with the bad credit report, I think you'd want the whole compensation > not just a part.) > > > Before the resync - we > > recover and forget any prepared/heuristically completed branches. > > See above. It doesn't help. Believe me: I've been in your position > before, back in the early 1990's ;-) I'll need to read your responses to the above before I'm convinced. > > If > > the XAResource proxy itself throws an exception indicating an invalid > > heuristic decision, only then do we let on to the transaction manager. > > In this case, we are forced to resolve the heuristic, either > > automatically (by the transaction manager) > > The transaction manager will not do this automatically in most cases. That's fine. The main side effect of an unresolved heuristic decision is the inability to activate/synchronize any inactive databases. > > or manually (by and admin) > > before any deactivated databases are reactivated - to protect data > > consistency. > > Remember the end of Ghostbusters - crossing the streams has its uses > > and > > is often the only way to send Gozer back to her dimension. > > Yes, but you'd also be sending most of New York back with her ;) > > > > > > > Transaction recovery should work fine. > > Confidence ;-) > > Have you tried it? No, actually. The durability logic is still in a development branch and has yet to be tested thoroughly. > We have a very thorough test suite for JBossTS > that's been built up over the past 12 years (in Java). More than 50% > of it is aimed at crash recovery tests, so it would be good to try it > out. This would be an invaluable resource. > > HA-JDBC's > > XAResource.recover(int) will return the aggregated set of Xids from > > each > > resource manager. > > What happens if some replicas are unavailable during recovery? Do you > exclude them? Yes - they would be deactivated. > > HA-JDBC currently assumes that the list of prepared > > and heuristically completed branches is the same for each active > > database. This is only true because of the protected-commit > > functionality, and the single tolerance policy of HA-JDBC's XAResource > > for incorrect/mixed/hazard heuristic outcomes. > > > >>> > >>> > >>>>> > >>>>> HA-JDBC also considers failure of the JDBC client itself, e.g. the > >>>>> application server. A client that fails - then comes back online > >>>>> must > >>>>> get its state (i.e. which databases are presently active) either > >>>>> from > >>>>> one of the other client nodes (in the distributed case) or from a > >>>>> local > >>>>> store (in the non-distributed case). > >>>> > >>>> > >>>> OK, but what if it fails during an update? > >>> > >>> During an update of what? > >> > >> The client is sending an update request to the db and fails. How does > >> it figure out the final outcome of the request? What about partial > >> updates? How does it know? > >> > > > > I think I already answered this question below when discussing failure > > of a client node in the during transaction commit/rollback. > > > >>> > >>> > >>>> How are you achieving consensus across the replicas in this case? > >>> > >>> By replica, I assume you mean database? > >> > >> Yes. > > > > Again, discussed below. Consensus is achieved via distributed > > acknowledgements. > > Well hopefully we all know that consensus is impossible to achieve in > an asynchronous environment with failures, correct :-) ? I think we're talking apples and oranges... By consensus, I'm referring to the conflict resolution strategy. As I mentioned earlier, currently, this uses an optimistic approach, though as you pointed out, and I agreed, this should change. > >>> There is no communication > >>> between replicas - only between HA-JDBC clients. When HA-JDBC > >>> starts - > >>> the current database cluster state (i.e. which database are > >>> active) is > >>> fetched from the JGroups group coordinator. If it's the first > >>> member, > >>> or if not running in distributed mode, then the state is fetched > >>> from a > >>> local store (currently uses java.util.prefs, though this will be > >>> pluggable in the future). > >>> > >>>> For instance, the client says "commit" to the replica group (I > >>>> assume > >>>> that the client has a group-like abstraction for the servers so > >>>> it's > >>>> not dealing with them all directly?) > >>> > >>> Which client are you talking about? > >> > >> The user of the JDBC connection. > > > > Yes - there's a group-like abstraction for the servers. > > OK, so I'm even more confused than I was given what you said way back > (that made me ask the unicast/multicast question). Oh - sorry - I misunderstood your original question... By "group-like abstraction for the servers", you meant: does the HA-JDBC driver use multicast for statement execution? No. > >>> The application client doesn't know > >>> (or care) whether its accessing a single database (via normal JDBC) > >>> or a > >>> cluster of databases (via HA-JDBC). > >> > >> Well I agree in theory, but as I've pointed out so far it may very > >> well have to care if the information we give back is not the same as > >> in the single case. > >> > >>> HA-JDBC is instrumented completely > >>> behind the standard JDBC interfaces. > >>> > >>>> and there is a failure of the client and some of the replicas > >>>> during > >>>> the commit operation. What replica consistency protocol are you > >>>> using > >>>> to enforce the state updates? > >>> > >>> Ah - this is discussed below. > >>> > >>>> Plus how many replicas do you need to tolerate N failures (N+1, 2N > >>>> +1, > >>>> 3N+1)? > >>>> > >>> > >>> N+1. If only 1 active database remains in the cluster, then HA-JDBC > >>> behaves like a normal JDBC driver. > >> > >> So definitely no network partition failures then. > > > > I do attempt to handle them as best as I can. > > There are two scenarios we consider: > > 1. Network partition affects connectivity to all databases > > 2. Network partition affects connectivity to all but one database > > (e.g. > > one database is local) > > > > The first is actually easier, because the affect to all replicas is > > the > > same. In general, HA-JDBC doesn't ever deactivate all databases. The > > client would end up seeing SQLExceptions until the network is > > resolved. > > What about multiple clients? So far you've really been talking about > one client using a replica group for JDBC. That's pretty easy to > solve, relative to concurrent clients. Especially when you throw in > network partitions. Hopefully I cleared up this misconception earlier. I *have* been talking about multiple clients. Multicast is not used for statement execution, but rather for cluster state management. Sorry for the confusion. > > The seconds scenario is problematic. Consider an environment of two > > application server nodes, > > No, let's talk about more than 2. If you only do 2 then it's better to > consider a primary copy or coordinator-cohort approach rather than > true active replication. > > > and an HA-JDBC cluster of 2 databases, where > > the databases are co-located on each application server machine. > > During > > a network partition, each application server sees that its local > > database is alive, but connectivity to the other database is down. > > Rather than allowing each server to proceed writing to its local > > database, creating a synchronization nightmare, > > (split brain syndrome is the term :-) > > > the HA-JDBC driver goes > > into cluster panic, and requires manual resolution. Luckily, this > > isn't > > that common, since a network partition likely affects connectivity to > > the application client as well. We are really only concerned with > > database write requests in process when the partition occurs. > > Well I'm more concerned with the case where we have concurrent clients > who see different replica groups because of network partitioning. So > let's assume we have 3 replicas and they're not co-located with any > application server. Then assume that we have a partition that splits > them 2:1 so that one client sees 1 replica and assumes 2 failures and > vice versa. If your network was asymmetric in such a way, i.e. one client + 2 database on one subnet, and 2 clients + 1 databases on another subnet, you would still use the local="true|false", where local implies local to subnet - so this scenario can still be detected. > >>>> There's the application client, but the important "client" for this > >>>> failure is the coordinator. How are you assuming this failure mode > >>>> will be dealt with by the coordinator if it can't determine who did > >>>> what when? > >>> > >>> Let me elaborate... > >>> If protected-commit mode is enabled, then then commits and rollbacks > >>> will trigger a 2-phase acknowledgement broadcast to the other > >>> nodes of > >>> the JGroups application cluster. > >> > >> Great. Who is maintaining the log for that then? Seems to me that > >> this > >> is starting to look more and more like what a transaction coordinator > >> would do ;-) > > > > Essentially, that's what the XAResource is doing - coordinating the > > atomic behavior resource managers that it proxies. The advantage is - > > instead of reporting mixed/hazard heuristic outcomes due to in-doubt > > transactions - we can report a definitive outcome to the transaction > > manager, since the in-doubt resource managers can be deactivated. > > No you can't. Not unless you've got durability in the XAResource > proxy. Do you? If not, can you explain how you can return a definitive > answer when the proxy fails? Yes - there is durability in both the XAResource and Connection proxies. The log for this is maintained on each client node, and also replicated via jgroups so that remote nodes can recover from partial commits instead of waiting for the crashed node to recover itself. > >>> If transaction-mode="serial" then within the HA-JDBC driver, > >>> commits/rollbacks will execute against each database > >>> sequentially. In > >>> this case, we send a series of notifications prior to the > >>> commit/rollback invocation on each database; and a second series of > >>> acknowledgements upon completion. If the application server > >>> executing > >>> the transaction fails in the middle, the other nodes will detect the > >>> view change and check to see if there are any unacknowledged > >>> commits/rollbacks. If so, the unacknowledged databases will be > >>> deactivated. > >> > >> OK, but what about concurrent and conflicting updates on the > >> databases > >> by multiple clients? Is it possible that two clients start at either > >> end of the replica list and meet in the middle and find that they > >> have > >> already committed conflicting updates because they did not know about > >> each other? (A white board would be a lot easier to explain this.) > > > > This doesn't happen as the replica list uses consistent ordered across > > multiple clients. > > What ordering protocol do you use? Causal, atomic, global? I didn't mean to imply a broadcast protocol - the commits will occur in alphanumeric order of their ids (when transaction-mode="serial"). > > > > > >>> If transaction-mode="parallel" then within the HA-JDBC driver, > >>> commits/rollbacks will execute against each database in parallel. > >>> In > >>> this case, we send a single notification prior to the invocations, > >>> and a > >>> second acknowledgement after the invocations complete. If, on view > >>> change, the other nodes find an unacknowledged commit/rollback, then > >>> the > >>> database cluster enters a panic state and requires manual > >>> resolution. > >> > >> I assume parallel is non-deterministic ordering of updates too across > >> replicas? > > > > Yes - this mode is useful (i.e. faster) when the likelihood of > > concurrent updates of the same data is small. > > What do you do in the case of conflicts? If there are concurrent updates of the same data and transaction-mode="parallel", and transaction 1 happens to get to resource 1 first, but transaction 2 happens to get to resource 2 first, then each will wait for the other until their transactions time out triggering a rollback. Again, this mode is an optimization for systems with a very low likelihood of concurrent updates. > >>> [OT] I'm considering adding a third "hybrid" transaction mode, where > >>> transactions execute on first on one database, then the rest in > >>> parallel. This mode seeks to address the risk of deadlocks with > >>> concurrent updates affecting parallel mode, while enabling better > >>> performance than serial mode, for clusters containing more than 2 > >>> databases. This mode would benefit from fine-grained transaction > >>> acknowledgements, while requiring less overhead than serial mode > >>> when > >>> cluster size > 2. > >> > >> What is the benefit of this over using a transaction coordinator to > >> do > >> all of this coordination for you in the first place? > > > > HA-JDBC is designed to function in both JTA and non-JTA environments. > > The above logic will guard against in-doubt transactions in both > > environments. > > I may have more to say on this once we get over the question of > durability in your XAResource ;-) What other questions do you have about durability in HA-JDBC? > >>>> What about a complete failure of every node in the system, i.e., > >>>> the > >>>> application, coordinator and all of the JDBC replicas fails? In the > >>>> wonderful world of transactions, we still need to guarantee > >>>> consistency in this case. > >>>> > >>> > >>> This is also handled. > >>> Just as cluster state changes are both broadcast to other nodes and > >>> recorded locally, so are commit/rollback acknowledgements. > >> > >> And heuristic failure? In fact, all responses from each XAResource > >> operation? > > > > > > Hopefully, I already answered this question above. If not, can you > > elaborate? > > > > I don't think we're quite finished yet ;-) Me neither. It occurs to me now that the current durability behavior is not yet sufficient for XA transactions, since the Xid is only recorded before commit/rollback. This will only detect crashes during commit/rollback, will not detect heuristic decisions made if the client node crashes after prepare() but *before* commit/rollback. To address this, I really ought to record Xids in prepare(). When the unfinished tx is detected (either locally after restart or remotely after jgroups membership change), I'll need to perform a XAResource.recover() + rollback() for each recorded Xid, and resolve and conflicting results. Anything else? Paul |
|
From: Mark L. <ml...@re...> - 2009-01-30 12:46:39
|
Hi Paul. Sorry for the delay in replying ... On 9 Dec 2008, at 18:33, Paul Ferraro wrote: >> >> OK, so failures are agreed across all clients and all members of the >> replica group? What's the protocol for rejoining the group? Does >> there >> have to be a quiescent period, for instance? > > Database deactivations (i.e., the action taken in response to singular > failures) are broadcast to all client nodes. This is not so much an > agreement as it is a decree - it cannot be vetoed. A replica group > member (i.e. a database) has no knowledge of the clients, nor other > replica group members. > Rejoining the replica group (i.e. for a database to become active > again), does indeed require a quiescent period. This is enforced by a > distributed read-write lock. On the whole database or just on the data that needs updating? The former is obviously easier but leads to more false locking, whereas the latter can be harder to ensure but does allow for continuous use of the remaining database tables (cf page level locking versus object level locking.) > New transactions (both local and XA) > acquire a local read locks on begin and release on commit/rollback. Locking what? > The > acquisition of a distributed write lock effectively blocks any > database > writes across the whole cluster, while still allowing database reads. OK, so this sounds like it's a lock on the database itself rather than any specific data items. However, it also seems like you're not using the traditional exclusive writer, concurrent readers. So how do you prevent dirty reads of data in that case? Are you assuming optimistic locking? > > The activation process is: > 1. Acquire distributed write lock > 2. Synchronize against an active database using a specified > synchronization strategy > 3. Broadcast an activation message to all client nodes I don't quite get the need for 3, but then this may have been covered before. Why doesn't the newly activated replica simply rejoin the group? Aren't we just using the groupid to communicate and not sending messages from clients to specific replicas, i.e., multicast versus multiple unicasts? (BTW, I'm assuming "client" here means database user.) > > 4. Release distributed write lock > > N.B. Suspended transactions will block the activation process, since a > newly joined replica will have missed the prepare phase. > Heuristically completed transactions (as defined by the XAResource > proxy > (discussed below) will also block the activation process. OK, that sounds good. >>> >>> This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC >>> exposes a single XAResource for the whole database cluster to the >>> transaction manager, while internally it manages an XAResource per >>> database. >> >> OK, so this single XAResource acts like a transaction coordinator >> then? With a log? > > Nope. The XAResource HA-JDBC exposes to the transaction manager is > merely a proxy to a set of underlying XAResources (one per database) > that implements HA behavior. It has no log. I'll go into detail > about > how this HA logic handles heuristic decisions later... OK, I'll wait ;-) > > >> Plus, the TM will not necessarily go through 2PC, >> but may in fact use a 1PC protocol. The problem here is that there >> are >> actually multiple resources registered (c.f. interposition or >> subordinate coordinators). The failures that can be returned from a >> 1PC interaction on an XAResource are actually different than those >> that can be returned during the full 2PC. What this means is that if >> you run into these failures behind your XAResource veneer, you will >> be >> incapable of giving the right information (right error codes and/or >> exceptions) back to the transaction manager and hence to the >> application. > > 1PC is any different from the perspective of HA-JDBC's XAResource > proxy. > For a given method invocation, HA-JDBC's XAResource proxy coordinates > the results from the individual XAResources it manages, deactivating > the > corresponding database if it responds anomalously (including > exceptions > with anomalous error codes). Yes - 1PC commits can throw additional > XA_RB* exceptions, but the corresponding HA logic is algorithmically > the > same. No, you misunderstand. If you are proxying as you mentioned, then you're effectively acting as a subordinate coordinator. Your XAResource is hiding 1..N different XAResource instances. In the case of the transaction manager, it sees only your XAResource. This means that if there are no other XAResources involved in the transaction, the coordinator is not going to drive your XAResource through 2PC: it'll call commit(true) and trigger the one-phase optimization. Now your XAResource knows that there are 1..N instances in reality that need coordinating. Let's assume it's 2..N because 1..1 is the degenerate case and is trivial to manage. This means that your XAResource is a subordinate coordinator and must run 2PC on behalf of the root coordinator. But unfortunately the route through the 2PC protocol allows for a number of different error conditions than the route through the 1PC termination protocol. So your XAResource will have to translate 2PC errors, for example, into 1PC (and sometimes that mapping is not one-to-one.) Do you handle that? > > >> As a slight aside: what do you do about non-deterministic SQL >> expressions such as may be involved with the current time? > > HA-JDBC can be configured to replace non-deterministic SQL expressions > with client generated values. This behavior is enabled per expression > type (e.g. eval-current-timestamp="true"): > http://ha-jdbc.sourceforge.net/doc.html#cluster > Of course, to avoid the cost of regular expression parsing, it's > best to > avoid these functions altogether. Yes, but if we want to opaquely replace JDBC with HA-JDBC then we need to assume that the client won't be doing anything to help us. > > >> Plus have >> you thought about serializability or linearizability issues across >> the >> replicas when multiple clients are involved? > > Yes. The biggest headaches are sequences and identity columns. > Currently, sequences next value functions and inserts into tables > containing identity columns require distributed synchronous access > (per > sequence name, or per table name). This is very cumbersome, so I > generally encourage HA-JDBC users to use alternative, more > cluster-friendly identifier generation mechanisms. The cost for > sequences can be mitigated somewhat using a hi-lo algorithm (e.g. > SequenceHiLoGenerator in Hibernate). > For databases where DatabaseMetaData.supportsGetGeneratedKeys() = > true, > I plan to implement a master-slave approach, where the sequence next > value is executed against the master database, and alter sequence > statement are executed against the slaves. This is a cleaner than the > current approach, though vendor support for > Statement.getGeneratedKeys() > is still fairly sparse. What are the performance implications so far and with your future plans? > > >>> Consequently, HA-JDBC can tolerate asymmetric failures in any >>> phase of a transaction (i.e. a phase fails against some databases, >>> but >>> succeeds against others) without forcing the whole transaction to >>> fail, >>> or become in-doubt, by pessimistically deactivating the failed >>> databases. >> >> I'm not so convinced that the impact on the transaction is so opaque, >> as I mentioned above. Obviously what we need to do is make a replica >> group appear to be identical to a single instance only more >> available. >> Simply gluing together multiple instances behind a single XAResource >> abstraction doesn't do it. Plus what about transaction recovery? By >> hiding the fact that some replicas may have generated heuristic >> outcomes whereas others haven't (because of failures), the >> transaction >> manager may go ahead and delete the log because it thinks that the >> transaction has completed successfully. That would be a very bad >> thing >> to do (think of crossing the streams in Ghostbusters ;-) Simply >> overwriting the failed heuristic replica db with a fresh copy from a >> successful db replica would also be a very bad thing to do (think >> letting the Keymaster and Gozer meet in Ghostbusters) because >> heuristics need to be resolved by the sys admin and the transaction >> log should not vanish until that has happened. So some kinds of >> failures of dbs need to be treated differently than just deactivating >> them. Can you cope with that? > > Perhaps I should go into more detail about what happens behind the > XAResource proxy. One of the primary responsibilities of the > XAResource > proxy is to coordinate the return values and/or exceptions from a > given > method invocation - and present them to the caller (in this case, the > transaction manager) as a single return value or exception. Yes, I kind of figured that, since it's a subordinate coordinator (see above). But if you're not recording these outcomes durably then your proxy is going to cause problems itself for transaction integrity. Or can you explain to me what happens in the situation where the proxy has obtained some outcomes from the 1..N XAResources but not all and then it crashes? > > > Let's look at a couple of result matrices of return values and error > codes for each XAResource for a replica group of 2 databases and the > coordinated result from the HA-JDBC XAResource proxy: > N.B. I've numbered each scenario for discussion purposes. > I've used wildcard (*) in some error codes for brevity > Exception codes are distinguished from return values using [] > > prepare(Xid): > > REF DB1 DB2 HA-JDBC > 1 XA_OK XA_OK XA_OK > 2 XA_RDONLY XA_RDONLY XA_RDONLY > 3 XA_OK XA_RDONLY XA_OK, deactivate db2 This should mean more than deactivating the database. If one resource thinks that there was no work done on the data it controls whereas the other does, that's a serious issue and you should rollback! commit isn't always the right way forward! This is different to some of the cases below because there are no errors reported from the resource. As such the problem is at the application level, or somewhere in the middleware stack. That fact alone means that anything the committable resource has done is suspect immediately. > > 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 > 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 > 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) > 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), > rollback and deactivate db1 > 8 XA_OK [XAER_*] XA_OK, deactivate db2 > 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 > 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 > 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > same Hmmm, well that's not good either, because XAER_* errors can mean very different things (in the standard as well as in the implementations). So simply assuming that all of the databased are still in a position where they can be treated as a single instance is not necessarily a good idea. I understand all of what you're trying to do (it's just using the available copies replication protocol, which has been around for several decades.) But even that required a level of durability at the recipient in order to deal with failures between it and the client. The recipient in this case is your XAResource proxy. > > > In summary, when the return values/exceptions from each database do > not > match, HA-JDBC makes a heuristic decision of it's own. HA-JDBC makes > the reasonable assumption that the JDBC client (i.e. application) > expects a given database transaction to succeed, therefore, an XA_OK > is > preferable to a rollback request (i.e. XA_RB* exception). +1 > > > commit(Xid, false): > > REF DB1 DB2 HA-JDBC > 1 void void void > 2 void [XA_HEURCOM] void, forget db2 What happens if the call to forget fails, or your proxy crashes before it can be made? > > 3 void [XA_HEURRB] void, forget and deactivate db2 > 4 void [XA_HEURMIX] void, forget and deactivate db2 > 5 void [XA_HEURHAZ] void, forget and deactivate db2 Same question as for 2. > > 6 void [XAER_*] void, deactivate db2 > 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] > 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 > > 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate db2 This is more concerning because you've got a mixed heuristic outcome which means that this resource was probably a subordinate coordinator itself or modifying multiple entries in the database. That'll make automatic recovery interesting unless you expect to copy one entire database image over the top of another? > > 10 [XA_HEURCOM] [XA_HEURHAZ] [XA_HEURCOM], forget and deactivate db2 As above, though this time we don't know what state db2 is in. > > 11 [XA_HEURCOM] [XAER_*] [XA_HEURCOM], deactivate db2 > 12 [XA_HEURRB] [XA_HEURRB] [XA_HEURRB] > 13 [XA_HEURRB] [XA_HEURMIX] [XA_HEURRB], forget and deactivate db2 > 14 [XA_HEURRB] [XA_HEURHAZ] [XA_HEURRB], forget and deactivate db2 > 15 [XA_HEURRB] [XAER_*] [XA_HEURRB], deactivate db2 > 16 [XA_HEURMIX] [XA_HEURMIX] [XA_HEURMIX], forget and deactivate > all but one db > 17 [XA_HEURMIX] [XA_HEURHAZ] [XA_HEURMIX], forget and deactivate db2 > 18 [XA_HEURMIX] [XAER_*] [XA_HEURMIX], deactivate db2 > 19 [XA_HEURHAZ] [XA_HEURHAZ] [XA_HEURHAZ]. forget and deactivate > all but one db > 20 [XA_HEURHAZ] [XAER_*] [XA_HEURHAZ], deactivate db2 > 21 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the > same > > commit(Xid, true): > N.B. There are no heuristic outcomes in the 1PC case, since there is > no > prepare. Erm, not true as I mentioned above. If your XAResource proxy is driven through 1PC and there is more than one database then you cannot simply drive them through 1PC - you must run 2PC and act as a subordinate. If you do not then you are risking more heuristic outcomes and still your XAResource proxy will require durability. > > > DB1 DB2 HA-JDBC > void void void > void XA_RB* void, deactivate db2 > XA_RB* XA_RB* XA_RB* (only if error code matches) > XA_RBBASE XA_RBEND XA_RBEND (prefer higher error code), rollback > and deactivate db1 > > In general, the HA logic picks the best outcomes, auto-forgets > heuristic > decisions that turn out to be correct, aggressively deactivates the > deviants from the replica group, and tolerates at most 1 > incorrect/mixed/hazard heuristic decision to bubble back to the > transaction manager. > > I don't think that your Ghostbusters analogy is very accurate. Hiding > heuristic decisions (e.g. #2-5) from the transaction manager is not > dangerous in the realm of HA-JDBC. If a heuristic decision turns > out to > be incorrect, then we can deactivate that database. It will get > re-sync'ed before it is reactivated, so the fact that a heuristic > decision was made at all is no longer relevant. Nope. And this is what concerns me. You need to understand what heuristics are about and why they have been developed over the past 4 decades. The reason that transaction managers don't ignore them and try to "make things right" (or hide them) is because we don't have the necessary semantic information to resolve them automatically. You're saying that you mask a heuristic database and take it out of the group so it won't cause problems. Later it'll be fixed by having the state overwritten by some good copy. But you are ignoring the fact that you do not have recovery capabilities built in to your XAResource as far as I can tell and are assuming that calling forget and/or deactivating the database will limit the problem. In reality, particularly with heuristics that cause the resource to commit or roll back, there may be systems and services tied in to those database at the back end which are triggered on commit/rollback and go and do something else (e.g., on rollback trigger the sending of a bad credit report on the user). In the real world, the TM would propagate the heuristic back to the end user and the sys admin would then go and either stop the credit report from being sent, or nullify it someone. How can he do that? Because he understands the semantics of the operation that was being executed. Does simply making the table entry the same as in one of the copies that didn't throw the heuristic help here? No of course it doesn't. It's just part of the solution (and if you were the person with the bad credit report, I think you'd want the whole compensation not just a part.) > Before the resync - we > recover and forget any prepared/heuristically completed branches. See above. It doesn't help. Believe me: I've been in your position before, back in the early 1990's ;-) > If > the XAResource proxy itself throws an exception indicating an invalid > heuristic decision, only then do we let on to the transaction manager. > In this case, we are forced to resolve the heuristic, either > automatically (by the transaction manager) The transaction manager will not do this automatically in most cases. > or manually (by and admin) > before any deactivated databases are reactivated - to protect data > consistency. > Remember the end of Ghostbusters - crossing the streams has its uses > and > is often the only way to send Gozer back to her dimension. Yes, but you'd also be sending most of New York back with her ;) > > > Transaction recovery should work fine. Confidence ;-) Have you tried it? We have a very thorough test suite for JBossTS that's been built up over the past 12 years (in Java). More than 50% of it is aimed at crash recovery tests, so it would be good to try it out. > HA-JDBC's > XAResource.recover(int) will return the aggregated set of Xids from > each > resource manager. What happens if some replicas are unavailable during recovery? Do you exclude them? > HA-JDBC currently assumes that the list of prepared > and heuristically completed branches is the same for each active > database. This is only true because of the protected-commit > functionality, and the single tolerance policy of HA-JDBC's XAResource > for incorrect/mixed/hazard heuristic outcomes. > >>> >>> >>>>> >>>>> HA-JDBC also considers failure of the JDBC client itself, e.g. the >>>>> application server. A client that fails - then comes back online >>>>> must >>>>> get its state (i.e. which databases are presently active) either >>>>> from >>>>> one of the other client nodes (in the distributed case) or from a >>>>> local >>>>> store (in the non-distributed case). >>>> >>>> >>>> OK, but what if it fails during an update? >>> >>> During an update of what? >> >> The client is sending an update request to the db and fails. How does >> it figure out the final outcome of the request? What about partial >> updates? How does it know? >> > > I think I already answered this question below when discussing failure > of a client node in the during transaction commit/rollback. > >>> >>> >>>> How are you achieving consensus across the replicas in this case? >>> >>> By replica, I assume you mean database? >> >> Yes. > > Again, discussed below. Consensus is achieved via distributed > acknowledgements. Well hopefully we all know that consensus is impossible to achieve in an asynchronous environment with failures, correct :-) ? > > >>> There is no communication >>> between replicas - only between HA-JDBC clients. When HA-JDBC >>> starts - >>> the current database cluster state (i.e. which database are >>> active) is >>> fetched from the JGroups group coordinator. If it's the first >>> member, >>> or if not running in distributed mode, then the state is fetched >>> from a >>> local store (currently uses java.util.prefs, though this will be >>> pluggable in the future). >>> >>>> For instance, the client says "commit" to the replica group (I >>>> assume >>>> that the client has a group-like abstraction for the servers so >>>> it's >>>> not dealing with them all directly?) >>> >>> Which client are you talking about? >> >> The user of the JDBC connection. > > Yes - there's a group-like abstraction for the servers. OK, so I'm even more confused than I was given what you said way back (that made me ask the unicast/multicast question). > > >>> The application client doesn't know >>> (or care) whether its accessing a single database (via normal JDBC) >>> or a >>> cluster of databases (via HA-JDBC). >> >> Well I agree in theory, but as I've pointed out so far it may very >> well have to care if the information we give back is not the same as >> in the single case. >> >>> HA-JDBC is instrumented completely >>> behind the standard JDBC interfaces. >>> >>>> and there is a failure of the client and some of the replicas >>>> during >>>> the commit operation. What replica consistency protocol are you >>>> using >>>> to enforce the state updates? >>> >>> Ah - this is discussed below. >>> >>>> Plus how many replicas do you need to tolerate N failures (N+1, 2N >>>> +1, >>>> 3N+1)? >>>> >>> >>> N+1. If only 1 active database remains in the cluster, then HA-JDBC >>> behaves like a normal JDBC driver. >> >> So definitely no network partition failures then. > > I do attempt to handle them as best as I can. > There are two scenarios we consider: > 1. Network partition affects connectivity to all databases > 2. Network partition affects connectivity to all but one database > (e.g. > one database is local) > > The first is actually easier, because the affect to all replicas is > the > same. In general, HA-JDBC doesn't ever deactivate all databases. The > client would end up seeing SQLExceptions until the network is > resolved. What about multiple clients? So far you've really been talking about one client using a replica group for JDBC. That's pretty easy to solve, relative to concurrent clients. Especially when you throw in network partitions. > > > The seconds scenario is problematic. Consider an environment of two > application server nodes, No, let's talk about more than 2. If you only do 2 then it's better to consider a primary copy or coordinator-cohort approach rather than true active replication. > and an HA-JDBC cluster of 2 databases, where > the databases are co-located on each application server machine. > During > a network partition, each application server sees that its local > database is alive, but connectivity to the other database is down. > Rather than allowing each server to proceed writing to its local > database, creating a synchronization nightmare, (split brain syndrome is the term :-) > the HA-JDBC driver goes > into cluster panic, and requires manual resolution. Luckily, this > isn't > that common, since a network partition likely affects connectivity to > the application client as well. We are really only concerned with > database write requests in process when the partition occurs. Well I'm more concerned with the case where we have concurrent clients who see different replica groups because of network partitioning. So let's assume we have 3 replicas and they're not co-located with any application server. Then assume that we have a partition that splits them 2:1 so that one client sees 1 replica and assumes 2 failures and vice versa. > > >>>> There's the application client, but the important "client" for this >>>> failure is the coordinator. How are you assuming this failure mode >>>> will be dealt with by the coordinator if it can't determine who did >>>> what when? >>> >>> Let me elaborate... >>> If protected-commit mode is enabled, then then commits and rollbacks >>> will trigger a 2-phase acknowledgement broadcast to the other >>> nodes of >>> the JGroups application cluster. >> >> Great. Who is maintaining the log for that then? Seems to me that >> this >> is starting to look more and more like what a transaction coordinator >> would do ;-) > > Essentially, that's what the XAResource is doing - coordinating the > atomic behavior resource managers that it proxies. The advantage is - > instead of reporting mixed/hazard heuristic outcomes due to in-doubt > transactions - we can report a definitive outcome to the transaction > manager, since the in-doubt resource managers can be deactivated. No you can't. Not unless you've got durability in the XAResource proxy. Do you? If not, can you explain how you can return a definitive answer when the proxy fails? > > >> >>> >>> If transaction-mode="serial" then within the HA-JDBC driver, >>> commits/rollbacks will execute against each database >>> sequentially. In >>> this case, we send a series of notifications prior to the >>> commit/rollback invocation on each database; and a second series of >>> acknowledgements upon completion. If the application server >>> executing >>> the transaction fails in the middle, the other nodes will detect the >>> view change and check to see if there are any unacknowledged >>> commits/rollbacks. If so, the unacknowledged databases will be >>> deactivated. >> >> OK, but what about concurrent and conflicting updates on the >> databases >> by multiple clients? Is it possible that two clients start at either >> end of the replica list and meet in the middle and find that they >> have >> already committed conflicting updates because they did not know about >> each other? (A white board would be a lot easier to explain this.) > > This doesn't happen as the replica list uses consistent ordered across > multiple clients. What ordering protocol do you use? Causal, atomic, global? > > >>> If transaction-mode="parallel" then within the HA-JDBC driver, >>> commits/rollbacks will execute against each database in parallel. >>> In >>> this case, we send a single notification prior to the invocations, >>> and a >>> second acknowledgement after the invocations complete. If, on view >>> change, the other nodes find an unacknowledged commit/rollback, then >>> the >>> database cluster enters a panic state and requires manual >>> resolution. >> >> I assume parallel is non-deterministic ordering of updates too across >> replicas? > > Yes - this mode is useful (i.e. faster) when the likelihood of > concurrent updates of the same data is small. What do you do in the case of conflicts? > > >>> [OT] I'm considering adding a third "hybrid" transaction mode, where >>> transactions execute on first on one database, then the rest in >>> parallel. This mode seeks to address the risk of deadlocks with >>> concurrent updates affecting parallel mode, while enabling better >>> performance than serial mode, for clusters containing more than 2 >>> databases. This mode would benefit from fine-grained transaction >>> acknowledgements, while requiring less overhead than serial mode >>> when >>> cluster size > 2. >> >> What is the benefit of this over using a transaction coordinator to >> do >> all of this coordination for you in the first place? > > HA-JDBC is designed to function in both JTA and non-JTA environments. > The above logic will guard against in-doubt transactions in both > environments. I may have more to say on this once we get over the question of durability in your XAResource ;-) > > >>> >>> >>>> What about a complete failure of every node in the system, i.e., >>>> the >>>> application, coordinator and all of the JDBC replicas fails? In the >>>> wonderful world of transactions, we still need to guarantee >>>> consistency in this case. >>>> >>> >>> This is also handled. >>> Just as cluster state changes are both broadcast to other nodes and >>> recorded locally, so are commit/rollback acknowledgements. >> >> And heuristic failure? In fact, all responses from each XAResource >> operation? > > > Hopefully, I already answered this question above. If not, can you > elaborate? > I don't think we're quite finished yet ;-) Mark. --- Mark Little ml...@re... JBoss, a Division of Red Hat Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland). |
|
From: Paul F. <pau...@re...> - 2008-12-09 19:45:28
|
On Thu, 2008-11-27 at 14:40 +0000, Mark Little wrote: > On 12 Nov 2008, at 22:41, Paul Ferraro wrote: > >> > >> Obviously the first thing that raises is how you detect failures, or > >> more accurately how you suspect failures and enforce the uniform > >> agreement across the entire system that a replica should be taken as > >> failed even if it is seen as available by some other set of the > >> distributed system. See http://arxiv.org/abs/cs.DC/0309026 > >> > > > > Failures are suspected anytime a JDBC invocation that hits the > > database > > throws a SQLException. Suspects are verified by executing a simple > > query (e.g. CALL CURRENT_TIMESTAMP) using a fresh connection to the > > database that threw the exception - or by comparing the response with > > responses from other databases executing the same invocation. Failed > > database get deactivated, i.e. removed from the set of active database > > that compose the database cluster state. HA-JDBC uses JGroups to > > replicate cluster state changes (i.e. database > > activations/deactivations) to the other nodes in your application > > cluster. > > OK, so failures are agreed across all clients and all members of the > replica group? What's the protocol for rejoining the group? Does there > have to be a quiescent period, for instance? Database deactivations (i.e., the action taken in response to singular failures) are broadcast to all client nodes. This is not so much an agreement as it is a decree - it cannot be vetoed. A replica group member (i.e. a database) has no knowledge of the clients, nor other replica group members. Rejoining the replica group (i.e. for a database to become active again), does indeed require a quiescent period. This is enforced by a distributed read-write lock. New transactions (both local and XA) acquire a local read locks on begin and release on commit/rollback. The acquisition of a distributed write lock effectively blocks any database writes across the whole cluster, while still allowing database reads. The activation process is: 1. Acquire distributed write lock 2. Synchronize against an active database using a specified synchronization strategy 3. Broadcast an activation message to all client nodes 4. Release distributed write lock N.B. Suspended transactions will block the activation process, since a newly joined replica will have missed the prepare phase. Heuristically completed transactions (as defined by the XAResource proxy (discussed below) will also block the activation process. > >> Another problem that this can introduce is heuristic management: if > >> the failure of a replica occurs at the right time (aka 'wrong time') > >> it's entirely possible that the work is done there when it's rolled > >> back at the others. This causes a heuristic. In fact, the resource > >> may > >> generate a heuristic at any time after prepare is called and > >> resolution of heuristics is a manual and often complex process. At > >> the > >> worst case, we could end up in a cascading rollback situation. > >> > > > > This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC > > exposes a single XAResource for the whole database cluster to the > > transaction manager, while internally it manages an XAResource per > > database. > > OK, so this single XAResource acts like a transaction coordinator > then? With a log? Nope. The XAResource HA-JDBC exposes to the transaction manager is merely a proxy to a set of underlying XAResources (one per database) that implements HA behavior. It has no log. I'll go into detail about how this HA logic handles heuristic decisions later... > Plus, the TM will not necessarily go through 2PC, > but may in fact use a 1PC protocol. The problem here is that there are > actually multiple resources registered (c.f. interposition or > subordinate coordinators). The failures that can be returned from a > 1PC interaction on an XAResource are actually different than those > that can be returned during the full 2PC. What this means is that if > you run into these failures behind your XAResource veneer, you will be > incapable of giving the right information (right error codes and/or > exceptions) back to the transaction manager and hence to the > application. 1PC is any different from the perspective of HA-JDBC's XAResource proxy. For a given method invocation, HA-JDBC's XAResource proxy coordinates the results from the individual XAResources it manages, deactivating the corresponding database if it responds anomalously (including exceptions with anomalous error codes). Yes - 1PC commits can throw additional XA_RB* exceptions, but the corresponding HA logic is algorithmically the same. > As a slight aside: what do you do about non-deterministic SQL > expressions such as may be involved with the current time? HA-JDBC can be configured to replace non-deterministic SQL expressions with client generated values. This behavior is enabled per expression type (e.g. eval-current-timestamp="true"): http://ha-jdbc.sourceforge.net/doc.html#cluster Of course, to avoid the cost of regular expression parsing, it's best to avoid these functions altogether. > Plus have > you thought about serializability or linearizability issues across the > replicas when multiple clients are involved? Yes. The biggest headaches are sequences and identity columns. Currently, sequences next value functions and inserts into tables containing identity columns require distributed synchronous access (per sequence name, or per table name). This is very cumbersome, so I generally encourage HA-JDBC users to use alternative, more cluster-friendly identifier generation mechanisms. The cost for sequences can be mitigated somewhat using a hi-lo algorithm (e.g. SequenceHiLoGenerator in Hibernate). For databases where DatabaseMetaData.supportsGetGeneratedKeys() = true, I plan to implement a master-slave approach, where the sequence next value is executed against the master database, and alter sequence statement are executed against the slaves. This is a cleaner than the current approach, though vendor support for Statement.getGeneratedKeys() is still fairly sparse. > > Consequently, HA-JDBC can tolerate asymmetric failures in any > > phase of a transaction (i.e. a phase fails against some databases, but > > succeeds against others) without forcing the whole transaction to > > fail, > > or become in-doubt, by pessimistically deactivating the failed > > databases. > > I'm not so convinced that the impact on the transaction is so opaque, > as I mentioned above. Obviously what we need to do is make a replica > group appear to be identical to a single instance only more available. > Simply gluing together multiple instances behind a single XAResource > abstraction doesn't do it. Plus what about transaction recovery? By > hiding the fact that some replicas may have generated heuristic > outcomes whereas others haven't (because of failures), the transaction > manager may go ahead and delete the log because it thinks that the > transaction has completed successfully. That would be a very bad thing > to do (think of crossing the streams in Ghostbusters ;-) Simply > overwriting the failed heuristic replica db with a fresh copy from a > successful db replica would also be a very bad thing to do (think > letting the Keymaster and Gozer meet in Ghostbusters) because > heuristics need to be resolved by the sys admin and the transaction > log should not vanish until that has happened. So some kinds of > failures of dbs need to be treated differently than just deactivating > them. Can you cope with that? Perhaps I should go into more detail about what happens behind the XAResource proxy. One of the primary responsibilities of the XAResource proxy is to coordinate the return values and/or exceptions from a given method invocation - and present them to the caller (in this case, the transaction manager) as a single return value or exception. Let's look at a couple of result matrices of return values and error codes for each XAResource for a replica group of 2 databases and the coordinated result from the HA-JDBC XAResource proxy: N.B. I've numbered each scenario for discussion purposes. I've used wildcard (*) in some error codes for brevity Exception codes are distinguished from return values using [] prepare(Xid): REF DB1 DB2 HA-JDBC 1 XA_OK XA_OK XA_OK 2 XA_RDONLY XA_RDONLY XA_RDONLY 3 XA_OK XA_RDONLY XA_OK, deactivate db2 4 XA_OK [XA_RB*] XA_OK, rollback and deactivate db2 5 XA_RDONLY [XA_RB*] XA_RDONLY, rollback and deactivate db2 6 [XA_RB*] [XA_RB*] [XA_RB*] (only if error code matches) 7 [XA_RBBASE] [XA_RBEND] [XA_RBEND] (prefer higher error code), rollback and deactivate db1 8 XA_OK [XAER_*] XA_OK, deactivate db2 9 XA_RDONLY [XAER_*] XA_RDONLY, deactivate db2 10 [XA_RB*] [XAER_*] [XA_RB*], deactivate db2 11 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the same In summary, when the return values/exceptions from each database do not match, HA-JDBC makes a heuristic decision of it's own. HA-JDBC makes the reasonable assumption that the JDBC client (i.e. application) expects a given database transaction to succeed, therefore, an XA_OK is preferable to a rollback request (i.e. XA_RB* exception). commit(Xid, false): REF DB1 DB2 HA-JDBC 1 void void void 2 void [XA_HEURCOM] void, forget db2 3 void [XA_HEURRB] void, forget and deactivate db2 4 void [XA_HEURMIX] void, forget and deactivate db2 5 void [XA_HEURHAZ] void, forget and deactivate db2 6 void [XAER_*] void, deactivate db2 7 [XA_HEURCOM] [XA_HEURCOM] [XA_HEURCOM] 8 [XA_HEURCOM] [XA_HEURRB] [XA_HEURCOM], forget and deactivate db2 9 [XA_HEURCOM] [XA_HEURMIX] [XA_HEURCOM], forget and deactivate db2 10 [XA_HEURCOM] [XA_HEURHAZ] [XA_HEURCOM], forget and deactivate db2 11 [XA_HEURCOM] [XAER_*] [XA_HEURCOM], deactivate db2 12 [XA_HEURRB] [XA_HEURRB] [XA_HEURRB] 13 [XA_HEURRB] [XA_HEURMIX] [XA_HEURRB], forget and deactivate db2 14 [XA_HEURRB] [XA_HEURHAZ] [XA_HEURRB], forget and deactivate db2 15 [XA_HEURRB] [XAER_*] [XA_HEURRB], deactivate db2 16 [XA_HEURMIX] [XA_HEURMIX] [XA_HEURMIX], forget and deactivate all but one db 17 [XA_HEURMIX] [XA_HEURHAZ] [XA_HEURMIX], forget and deactivate db2 18 [XA_HEURMIX] [XAER_*] [XA_HEURMIX], deactivate db2 19 [XA_HEURHAZ] [XA_HEURHAZ] [XA_HEURHAZ]. forget and deactivate all but one db 20 [XA_HEURHAZ] [XAER_*] [XA_HEURHAZ], deactivate db2 21 [XAER_*] [XAER_*] [XAER_*], chain exceptions if errors not the same commit(Xid, true): N.B. There are no heuristic outcomes in the 1PC case, since there is no prepare. DB1 DB2 HA-JDBC void void void void XA_RB* void, deactivate db2 XA_RB* XA_RB* XA_RB* (only if error code matches) XA_RBBASE XA_RBEND XA_RBEND (prefer higher error code), rollback and deactivate db1 In general, the HA logic picks the best outcomes, auto-forgets heuristic decisions that turn out to be correct, aggressively deactivates the deviants from the replica group, and tolerates at most 1 incorrect/mixed/hazard heuristic decision to bubble back to the transaction manager. I don't think that your Ghostbusters analogy is very accurate. Hiding heuristic decisions (e.g. #2-5) from the transaction manager is not dangerous in the realm of HA-JDBC. If a heuristic decision turns out to be incorrect, then we can deactivate that database. It will get re-sync'ed before it is reactivated, so the fact that a heuristic decision was made at all is no longer relevant. Before the resync - we recover and forget any prepared/heuristically completed branches. If the XAResource proxy itself throws an exception indicating an invalid heuristic decision, only then do we let on to the transaction manager. In this case, we are forced to resolve the heuristic, either automatically (by the transaction manager) or manually (by and admin) before any deactivated databases are reactivated - to protect data consistency. Remember the end of Ghostbusters - crossing the streams has its uses and is often the only way to send Gozer back to her dimension. Transaction recovery should work fine. HA-JDBC's XAResource.recover(int) will return the aggregated set of Xids from each resource manager. HA-JDBC currently assumes that the list of prepared and heuristically completed branches is the same for each active database. This is only true because of the protected-commit functionality, and the single tolerance policy of HA-JDBC's XAResource for incorrect/mixed/hazard heuristic outcomes. > > > > > >>> > >>> HA-JDBC also considers failure of the JDBC client itself, e.g. the > >>> application server. A client that fails - then comes back online > >>> must > >>> get its state (i.e. which databases are presently active) either > >>> from > >>> one of the other client nodes (in the distributed case) or from a > >>> local > >>> store (in the non-distributed case). > >> > >> > >> OK, but what if it fails during an update? > > > > During an update of what? > > The client is sending an update request to the db and fails. How does > it figure out the final outcome of the request? What about partial > updates? How does it know? > I think I already answered this question below when discussing failure of a client node in the during transaction commit/rollback. > > > > > >> How are you achieving consensus across the replicas in this case? > > > > By replica, I assume you mean database? > > Yes. Again, discussed below. Consensus is achieved via distributed acknowledgements. > > There is no communication > > between replicas - only between HA-JDBC clients. When HA-JDBC > > starts - > > the current database cluster state (i.e. which database are active) is > > fetched from the JGroups group coordinator. If it's the first member, > > or if not running in distributed mode, then the state is fetched > > from a > > local store (currently uses java.util.prefs, though this will be > > pluggable in the future). > > > >> For instance, the client says "commit" to the replica group (I assume > >> that the client has a group-like abstraction for the servers so it's > >> not dealing with them all directly?) > > > > Which client are you talking about? > > The user of the JDBC connection. Yes - there's a group-like abstraction for the servers. > > The application client doesn't know > > (or care) whether its accessing a single database (via normal JDBC) > > or a > > cluster of databases (via HA-JDBC). > > Well I agree in theory, but as I've pointed out so far it may very > well have to care if the information we give back is not the same as > in the single case. > > > HA-JDBC is instrumented completely > > behind the standard JDBC interfaces. > > > >> and there is a failure of the client and some of the replicas during > >> the commit operation. What replica consistency protocol are you using > >> to enforce the state updates? > > > > Ah - this is discussed below. > > > >> Plus how many replicas do you need to tolerate N failures (N+1, 2N+1, > >> 3N+1)? > >> > > > > N+1. If only 1 active database remains in the cluster, then HA-JDBC > > behaves like a normal JDBC driver. > > So definitely no network partition failures then. I do attempt to handle them as best as I can. There are two scenarios we consider: 1. Network partition affects connectivity to all databases 2. Network partition affects connectivity to all but one database (e.g. one database is local) The first is actually easier, because the affect to all replicas is the same. In general, HA-JDBC doesn't ever deactivate all databases. The client would end up seeing SQLExceptions until the network is resolved. The seconds scenario is problematic. Consider an environment of two application server nodes, and an HA-JDBC cluster of 2 databases, where the databases are co-located on each application server machine. During a network partition, each application server sees that its local database is alive, but connectivity to the other database is down. Rather than allowing each server to proceed writing to its local database, creating a synchronization nightmare, the HA-JDBC driver goes into cluster panic, and requires manual resolution. Luckily, this isn't that common, since a network partition likely affects connectivity to the application client as well. We are really only concerned with database write requests in process when the partition occurs. > >> There's the application client, but the important "client" for this > >> failure is the coordinator. How are you assuming this failure mode > >> will be dealt with by the coordinator if it can't determine who did > >> what when? > > > > Let me elaborate... > > If protected-commit mode is enabled, then then commits and rollbacks > > will trigger a 2-phase acknowledgement broadcast to the other nodes of > > the JGroups application cluster. > > Great. Who is maintaining the log for that then? Seems to me that this > is starting to look more and more like what a transaction coordinator > would do ;-) Essentially, that's what the XAResource is doing - coordinating the atomic behavior resource managers that it proxies. The advantage is - instead of reporting mixed/hazard heuristic outcomes due to in-doubt transactions - we can report a definitive outcome to the transaction manager, since the in-doubt resource managers can be deactivated. > > > > > If transaction-mode="serial" then within the HA-JDBC driver, > > commits/rollbacks will execute against each database sequentially. In > > this case, we send a series of notifications prior to the > > commit/rollback invocation on each database; and a second series of > > acknowledgements upon completion. If the application server executing > > the transaction fails in the middle, the other nodes will detect the > > view change and check to see if there are any unacknowledged > > commits/rollbacks. If so, the unacknowledged databases will be > > deactivated. > > OK, but what about concurrent and conflicting updates on the databases > by multiple clients? Is it possible that two clients start at either > end of the replica list and meet in the middle and find that they have > already committed conflicting updates because they did not know about > each other? (A white board would be a lot easier to explain this.) This doesn't happen as the replica list uses consistent ordered across multiple clients. > > If transaction-mode="parallel" then within the HA-JDBC driver, > > commits/rollbacks will execute against each database in parallel. In > > this case, we send a single notification prior to the invocations, > > and a > > second acknowledgement after the invocations complete. If, on view > > change, the other nodes find an unacknowledged commit/rollback, then > > the > > database cluster enters a panic state and requires manual resolution. > > I assume parallel is non-deterministic ordering of updates too across > replicas? Yes - this mode is useful (i.e. faster) when the likelihood of concurrent updates of the same data is small. > > [OT] I'm considering adding a third "hybrid" transaction mode, where > > transactions execute on first on one database, then the rest in > > parallel. This mode seeks to address the risk of deadlocks with > > concurrent updates affecting parallel mode, while enabling better > > performance than serial mode, for clusters containing more than 2 > > databases. This mode would benefit from fine-grained transaction > > acknowledgements, while requiring less overhead than serial mode when > > cluster size > 2. > > What is the benefit of this over using a transaction coordinator to do > all of this coordination for you in the first place? HA-JDBC is designed to function in both JTA and non-JTA environments. The above logic will guard against in-doubt transactions in both environments. > > > > > >> What about a complete failure of every node in the system, i.e., the > >> application, coordinator and all of the JDBC replicas fails? In the > >> wonderful world of transactions, we still need to guarantee > >> consistency in this case. > >> > > > > This is also handled. > > Just as cluster state changes are both broadcast to other nodes and > > recorded locally, so are commit/rollback acknowledgements. > > And heuristic failure? In fact, all responses from each XAResource > operation? Hopefully, I already answered this question above. If not, can you elaborate? Paul > > Mark. > > --- > Mark Little > ml...@re... > > JBoss, a Division of Red Hat > Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod > Street, Windsor, Berkshire, SI4 1TE, United Kingdom. > Registered in UK and Wales under Company Registration No. 3798903 > Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt > Parsons (USA) and Brendan Lane (Ireland). > > > > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's challenge > Build the coolest Linux based applications with Moblin SDK & win great prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > ha-jdbc-devel mailing list > ha-...@li... > https://lists.sourceforge.net/lists/listinfo/ha-jdbc-devel |
|
From: Mark L. <ml...@re...> - 2008-11-27 14:40:24
|
On 12 Nov 2008, at 22:41, Paul Ferraro wrote: >> >> >> Obviously the first thing that raises is how you detect failures, or >> more accurately how you suspect failures and enforce the uniform >> agreement across the entire system that a replica should be taken as >> failed even if it is seen as available by some other set of the >> distributed system. See http://arxiv.org/abs/cs.DC/0309026 >> > > Failures are suspected anytime a JDBC invocation that hits the > database > throws a SQLException. Suspects are verified by executing a simple > query (e.g. CALL CURRENT_TIMESTAMP) using a fresh connection to the > database that threw the exception - or by comparing the response with > responses from other databases executing the same invocation. Failed > database get deactivated, i.e. removed from the set of active database > that compose the database cluster state. HA-JDBC uses JGroups to > replicate cluster state changes (i.e. database > activations/deactivations) to the other nodes in your application > cluster. OK, so failures are agreed across all clients and all members of the replica group? What's the protocol for rejoining the group? Does there have to be a quiescent period, for instance? > > >> >> Another problem that this can introduce is heuristic management: if >> the failure of a replica occurs at the right time (aka 'wrong time') >> it's entirely possible that the work is done there when it's rolled >> back at the others. This causes a heuristic. In fact, the resource >> may >> generate a heuristic at any time after prepare is called and >> resolution of heuristics is a manual and often complex process. At >> the >> worst case, we could end up in a cascading rollback situation. >> > > This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC > exposes a single XAResource for the whole database cluster to the > transaction manager, while internally it manages an XAResource per > database. OK, so this single XAResource acts like a transaction coordinator then? With a log? Plus, the TM will not necessarily go through 2PC, but may in fact use a 1PC protocol. The problem here is that there are actually multiple resources registered (c.f. interposition or subordinate coordinators). The failures that can be returned from a 1PC interaction on an XAResource are actually different than those that can be returned during the full 2PC. What this means is that if you run into these failures behind your XAResource veneer, you will be incapable of giving the right information (right error codes and/or exceptions) back to the transaction manager and hence to the application. As a slight aside: what do you do about non-deterministic SQL expressions such as may be involved with the current time? Plus have you thought about serializability or linearizability issues across the replicas when multiple clients are involved? > Consequently, HA-JDBC can tolerate asymmetric failures in any > phase of a transaction (i.e. a phase fails against some databases, but > succeeds against others) without forcing the whole transaction to > fail, > or become in-doubt, by pessimistically deactivating the failed > databases. I'm not so convinced that the impact on the transaction is so opaque, as I mentioned above. Obviously what we need to do is make a replica group appear to be identical to a single instance only more available. Simply gluing together multiple instances behind a single XAResource abstraction doesn't do it. Plus what about transaction recovery? By hiding the fact that some replicas may have generated heuristic outcomes whereas others haven't (because of failures), the transaction manager may go ahead and delete the log because it thinks that the transaction has completed successfully. That would be a very bad thing to do (think of crossing the streams in Ghostbusters ;-) Simply overwriting the failed heuristic replica db with a fresh copy from a successful db replica would also be a very bad thing to do (think letting the Keymaster and Gozer meet in Ghostbusters) because heuristics need to be resolved by the sys admin and the transaction log should not vanish until that has happened. So some kinds of failures of dbs need to be treated differently than just deactivating them. Can you cope with that? > > >>> >>> HA-JDBC also considers failure of the JDBC client itself, e.g. the >>> application server. A client that fails - then comes back online >>> must >>> get its state (i.e. which databases are presently active) either >>> from >>> one of the other client nodes (in the distributed case) or from a >>> local >>> store (in the non-distributed case). >> >> >> OK, but what if it fails during an update? > > During an update of what? The client is sending an update request to the db and fails. How does it figure out the final outcome of the request? What about partial updates? How does it know? > > >> How are you achieving consensus across the replicas in this case? > > By replica, I assume you mean database? Yes. > There is no communication > between replicas - only between HA-JDBC clients. When HA-JDBC > starts - > the current database cluster state (i.e. which database are active) is > fetched from the JGroups group coordinator. If it's the first member, > or if not running in distributed mode, then the state is fetched > from a > local store (currently uses java.util.prefs, though this will be > pluggable in the future). > >> For instance, the client says "commit" to the replica group (I assume >> that the client has a group-like abstraction for the servers so it's >> not dealing with them all directly?) > > Which client are you talking about? The user of the JDBC connection. > The application client doesn't know > (or care) whether its accessing a single database (via normal JDBC) > or a > cluster of databases (via HA-JDBC). Well I agree in theory, but as I've pointed out so far it may very well have to care if the information we give back is not the same as in the single case. > HA-JDBC is instrumented completely > behind the standard JDBC interfaces. > >> and there is a failure of the client and some of the replicas during >> the commit operation. What replica consistency protocol are you using >> to enforce the state updates? > > Ah - this is discussed below. > >> Plus how many replicas do you need to tolerate N failures (N+1, 2N+1, >> 3N+1)? >> > > N+1. If only 1 active database remains in the cluster, then HA-JDBC > behaves like a normal JDBC driver. So definitely no network partition failures then. >> >> There's the application client, but the important "client" for this >> failure is the coordinator. How are you assuming this failure mode >> will be dealt with by the coordinator if it can't determine who did >> what when? > > Let me elaborate... > If protected-commit mode is enabled, then then commits and rollbacks > will trigger a 2-phase acknowledgement broadcast to the other nodes of > the JGroups application cluster. Great. Who is maintaining the log for that then? Seems to me that this is starting to look more and more like what a transaction coordinator would do ;-) > > If transaction-mode="serial" then within the HA-JDBC driver, > commits/rollbacks will execute against each database sequentially. In > this case, we send a series of notifications prior to the > commit/rollback invocation on each database; and a second series of > acknowledgements upon completion. If the application server executing > the transaction fails in the middle, the other nodes will detect the > view change and check to see if there are any unacknowledged > commits/rollbacks. If so, the unacknowledged databases will be > deactivated. OK, but what about concurrent and conflicting updates on the databases by multiple clients? Is it possible that two clients start at either end of the replica list and meet in the middle and find that they have already committed conflicting updates because they did not know about each other? (A white board would be a lot easier to explain this.) > > If transaction-mode="parallel" then within the HA-JDBC driver, > commits/rollbacks will execute against each database in parallel. In > this case, we send a single notification prior to the invocations, > and a > second acknowledgement after the invocations complete. If, on view > change, the other nodes find an unacknowledged commit/rollback, then > the > database cluster enters a panic state and requires manual resolution. I assume parallel is non-deterministic ordering of updates too across replicas? > > [OT] I'm considering adding a third "hybrid" transaction mode, where > transactions execute on first on one database, then the rest in > parallel. This mode seeks to address the risk of deadlocks with > concurrent updates affecting parallel mode, while enabling better > performance than serial mode, for clusters containing more than 2 > databases. This mode would benefit from fine-grained transaction > acknowledgements, while requiring less overhead than serial mode when > cluster size > 2. What is the benefit of this over using a transaction coordinator to do all of this coordination for you in the first place? > > >> What about a complete failure of every node in the system, i.e., the >> application, coordinator and all of the JDBC replicas fails? In the >> wonderful world of transactions, we still need to guarantee >> consistency in this case. >> > > This is also handled. > Just as cluster state changes are both broadcast to other nodes and > recorded locally, so are commit/rollback acknowledgements. And heuristic failure? In fact, all responses from each XAResource operation? Mark. --- Mark Little ml...@re... JBoss, a Division of Red Hat Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland). |
|
From: Paul F. <pfe...@re...> - 2008-11-12 22:41:39
|
On Sun, 2008-11-09 at 15:03 +0000, Mark Little wrote: > On 6 Nov 2008, at 22:18, Paul Ferraro wrote: > > On Mon, 2008-10-20 at 11:54 +0100, Mark Little wrote: > > > Probably a discussion to be had on a separate list, but I'd be > > > interested in knowing what the failure assumptions are for a > > > start. > > > > > > Mark. > > > > > For starters, I'd like to clarify some potential confusion about > > what > > HA-JDBC is. HA solutions for the data layer generally fall into one > > of > > six categories: > > * Shared disk failover (e.g. Oracle RAC) > > * File system replication (e.g. DRDB) > > * Warm standby (e.g. PostgreSQL PITR) > > * Single master (aka master/slave) replication (e.g. Slony-I) > > * Sync/async multi-master replication (e.g. DBReplicator) > > * Statement-based replication (e.g. HA-JDBC, Sequoia, pgpool-II) > > > Agreed and understood. However, have you taken a look at > > > http://www.springerlink.com/content/4xuj61j6ewnmadfp/ > http://dblp.uni-trier.de/rec/bibtex/conf/dais/MorganSEL99 > http://www.informatik.uni-trier.de/~ley/db/conf/pos/LittleS98.html > http://dblp.uni-trier.de/rec/bibtex/conf/icdcs/LittleMS93 > http://dblp.uni-trier.de/rec/bibtex/conf/wmrd/LittleS90 > http://old.cs.ncl.ac.uk/research/pubs/authors/byType.php?id=16 > I have not. I'll check them out. > > So HA-JDBC is a middleware solution - not a data store solution. It > > operates entirely within the JVM of the JDBC client (e.g. JBoss AS). > > Statement-based replication solutions have some general > > advantages/disadvantages over other replication types. We can > > discuss > > this separately if you'd like, as well as advantages/disadvantages > > of > > HA-JDBC over other statement-based replication solutions. > > > Definitely. As you can probably figure out from some of the things > above, this is an area I've spent a fair bit of time on over the past > 20+ years so I'm definitely interested. And, likewise, I'm interested in your feedback... ;) > > Regarding failure assumptions, HA-JDBC assumes that connectivity to > > one > > or more databases can be lost at any point during the lifecycle of a > > connection, statement, or resultset. This includes during: > > * connection creation > > * read statement execution > > * write statement execution > > * statement batch execution > > * query execution > > * transaction commit/rollback (local or XA) > > * connection close > > In general, a SQLException determined to be due to connectivity > > failure > > will result in pessimistic deactivation of that database. A > > deactivated > > database will not receive database requests until it is synchronized > > and > > re-activated either manually, or automatically (via the > > auto-activate > > scheduling mechanism). > > > Obviously the first thing that raises is how you detect failures, or > more accurately how you suspect failures and enforce the uniform > agreement across the entire system that a replica should be taken as > failed even if it is seen as available by some other set of the > distributed system. See http://arxiv.org/abs/cs.DC/0309026 > Failures are suspected anytime a JDBC invocation that hits the database throws a SQLException. Suspects are verified by executing a simple query (e.g. CALL CURRENT_TIMESTAMP) using a fresh connection to the database that threw the exception - or by comparing the response with responses from other databases executing the same invocation. Failed database get deactivated, i.e. removed from the set of active database that compose the database cluster state. HA-JDBC uses JGroups to replicate cluster state changes (i.e. database activations/deactivations) to the other nodes in your application cluster. > > Another problem that this can introduce is heuristic management: if > the failure of a replica occurs at the right time (aka 'wrong time') > it's entirely possible that the work is done there when it's rolled > back at the others. This causes a heuristic. In fact, the resource may > generate a heuristic at any time after prepare is called and > resolution of heuristics is a manual and often complex process. At the > worst case, we could end up in a cascading rollback situation. > This is not really an issue in HA-JDBC. In the XA scenario, HA-JDBC exposes a single XAResource for the whole database cluster to the transaction manager, while internally it manages an XAResource per database. Consequently, HA-JDBC can tolerate asymmetric failures in any phase of a transaction (i.e. a phase fails against some databases, but succeeds against others) without forcing the whole transaction to fail, or become in-doubt, by pessimistically deactivating the failed databases. > > > > HA-JDBC also considers failure of the JDBC client itself, e.g. the > > application server. A client that fails - then comes back online > > must > > get its state (i.e. which databases are presently active) either > > from > > one of the other client nodes (in the distributed case) or from a > > local > > store (in the non-distributed case). > > > OK, but what if it fails during an update? During an update of what? > How are you achieving consensus across the replicas in this case? By replica, I assume you mean database? There is no communication between replicas - only between HA-JDBC clients. When HA-JDBC starts - the current database cluster state (i.e. which database are active) is fetched from the JGroups group coordinator. If it's the first member, or if not running in distributed mode, then the state is fetched from a local store (currently uses java.util.prefs, though this will be pluggable in the future). > For instance, the client says "commit" to the replica group (I assume > that the client has a group-like abstraction for the servers so it's > not dealing with them all directly?) Which client are you talking about? The application client doesn't know (or care) whether its accessing a single database (via normal JDBC) or a cluster of databases (via HA-JDBC). HA-JDBC is instrumented completely behind the standard JDBC interfaces. > and there is a failure of the client and some of the replicas during > the commit operation. What replica consistency protocol are you using > to enforce the state updates? Ah - this is discussed below. > Plus how many replicas do you need to tolerate N failures (N+1, 2N+1, > 3N+1)? > N+1. If only 1 active database remains in the cluster, then HA-JDBC behaves like a normal JDBC driver. > > > > A special consideration is failure of a client during transaction > > commit > > (local or XA). > > > Yes. > > > This situation threatens data synchronicity > > > OK, so here's where we get into what I mentioned above: data > consistency. > > > since the > > commit may have succeeded on some nodes and not others, but the > > client > > is no longer available to detect or handle the partial failure. > > This > > condition needs to be detected by peer client nodes (notified of the > > crash through JGroups membership change), and by the failed node > > when it > > restarts. > > > So which client are we talking about here? The JDBC client. > There's the application client, but the important "client" for this > failure is the coordinator. How are you assuming this failure mode > will be dealt with by the coordinator if it can't determine who did > what when? Let me elaborate... If protected-commit mode is enabled, then then commits and rollbacks will trigger a 2-phase acknowledgement broadcast to the other nodes of the JGroups application cluster. If transaction-mode="serial" then within the HA-JDBC driver, commits/rollbacks will execute against each database sequentially. In this case, we send a series of notifications prior to the commit/rollback invocation on each database; and a second series of acknowledgements upon completion. If the application server executing the transaction fails in the middle, the other nodes will detect the view change and check to see if there are any unacknowledged commits/rollbacks. If so, the unacknowledged databases will be deactivated. If transaction-mode="parallel" then within the HA-JDBC driver, commits/rollbacks will execute against each database in parallel. In this case, we send a single notification prior to the invocations, and a second acknowledgement after the invocations complete. If, on view change, the other nodes find an unacknowledged commit/rollback, then the database cluster enters a panic state and requires manual resolution. [OT] I'm considering adding a third "hybrid" transaction mode, where transactions execute on first on one database, then the rest in parallel. This mode seeks to address the risk of deadlocks with concurrent updates affecting parallel mode, while enabling better performance than serial mode, for clusters containing more than 2 databases. This mode would benefit from fine-grained transaction acknowledgements, while requiring less overhead than serial mode when cluster size > 2. > What about a complete failure of every node in the system, i.e., the > application, coordinator and all of the JDBC replicas fails? In the > wonderful world of transactions, we still need to guarantee > consistency in this case. > This is also handled. Just as cluster state changes are both broadcast to other nodes and recorded locally, so are commit/rollback acknowledgements. When a server starts up, it consults is local acknowledgement store. If unacknowledged commits/rollbacks are found, then we respond as described above. > > Synchronously executed commits can be handled automatically. > > > Is this related to the previous sentence? > Yes - meaning, if the commit/rollback was executed synchronously on each active database in the cluster (i.e. transaction-mode="serial"). > > > > Asynchronously executed commits will trigger a state of "cluster > > panic" > > and must be resolved manually. > > > > > > > I'd need more information on that to understand precisely how it would > impact transactional guarantees. > > Mark. |