Menu

#403 XA rollback exception with stale transaction

JDBC
closed
None
5
2020-11-17
2018-03-05
No

We are running an XA transaction in which an AS400 DB2 stored procedure is called, running on a JBoss EAP which is the XA coordinator. We have set the query timeout to 10 seconds and the query timeout mechanism in the AS400 data source to have the value "cancel", so that an actual attempt to cancel the operation is made at 10 seconds after Statement::execute() is called. Additionally the JBoss JTA runtime is set so that the "reaper" mechanism that is used, kicks in after 2 minutes, finds any threads that are still working on a pending transaction and marks the thread itself as a zombie and the transaction as rollback-only. This means that "transactional" threads that take more than 2 minutes can only issue a rollback.

In all cases the stored procedure gets stoped in a MSGW condition in IBM-i system (the data is such that an abnormal condition is triggered in the back end). The problem - as I will describe below - is that in certain situations where we expect a rollback to take place, it appears that the XA rollback command never reaches the back end system.

We get the following stack trace in the log file of JBoss:

10:27:35,529 WARN  [com.arjuna.ats.jta] (Transaction Reaper Worker 1) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a00f0e2:72a300d8:5a990693:9b, node_name=1, branch_uid=0:ffff0a00f0e2:72a300d8:5a990693:9e, subordinatenodename=null, eis_name=unknown eis name > (org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@49fde3d8) failed with exception code -: java.lang.NullPointerException
    at java.lang.Throwable.printStackTrace(Throwable.java:653) [rt.jar:1.8.0_102]
    at java.lang.Throwable.printStackTrace(Throwable.java:721) [rt.jar:1.8.0_102]
    at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:967)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:368)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]

And the following stack trace in the trace file for the JDBC driver:

java.lang.Throwable
    at com.ibm.as400.access.Trace.log(Trace.java:1164)
    at com.ibm.as400.access.Trace.log(Trace.java:1092)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3046)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
Thread[Transaction Reaper Worker 1,5,main]  Παρ Μαρ 2 10:27:35:529 EET 2018  as400: Connection LOGIASP (616916623) : Throwing exception. Original exception: .java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at com.ibm.as400.access.AS400ThreadedServer.receive(AS400ThreadedServer.java:339)
    at com.ibm.as400.access.AS400ThreadedServer.sendAndReceive(AS400ThreadedServer.java:492)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3005)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
Thread[Transaction Reaper Worker 1,5,main]  Παρ Μαρ 2 10:27:35:529 EET 2018  as400: Connection LOGIASP (616916623) : Throwing exception.  Actual exception: Internal driver error. (class java.lang.InterruptedException) sqlState: HY000 vendor code -99999.java.sql.SQLException: Internal driver error. (class java.lang.InterruptedException)
    at com.ibm.as400.access.JDError.createSQLExceptionSubClass(JDError.java:928)
    at com.ibm.as400.access.JDError.throwSQLException(JDError.java:559)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3052)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)

We are running the following scenaria:
- run1: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job before the 10 seconds have expired.
- run2: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job after the 10 seconds have expired, but before the 2 minute reaper timeout.
- run3: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job after the 2 minute reaper timeout has expired.
- run4: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "go" on the back end job after the 2 minute reaper timeout has expired.

We would expect all scenaria to result in a rollback for the transaction. This is indeed the case, but in the last two scenaria, the rollback command issued by the JBoss JTA runtime never gets delivered to the AS400 and resources are apparenly blocked. You can see this in the attached log files of JBoss and the Toolbox trace.

Please advise.

1 Attachments

Related

Bugs: #403

Discussion

  • John Eberhard

    John Eberhard - 2018-03-05
    • status: open --> closed-server-problem
    • assigned_to: John Eberhard
     
    • Pafsanias Ftakas

      Hello,

      I don't think I understand why you dismissed this problem as a server issue. We use the MSGW as a means to get into the situation where the transaction is taking more than 2 minutes, thereby causing the container application server to mark the transaction as rollback only. We do not care about the MSGW itself. The same situation might have arisen because of lock or contention in the back end that causes the transaction to exceed the 2 minute limit.
      Whatever the reason, when the container attempts to issue an XA rollback, this should reach the server, correct? But it does not and the driver call fails with the following:
      "as400: Connection LOGIASP (616916623) : Throwing exception. Actual exception: Internal driver error. (class java.lang.InterruptedException) sqlState: HY000 vendor code -99999.java.sql.SQLException: Internal driver error. (class java.lang.InterruptedException)"
      Are you implying that this is a server error? How can I support that, so that I can make it IBM's problem?

      Please help.
      Kind regards,
      Pafsanias Ftakas

      Pafsanias Ftakas
      salagator@gmail.com (https://link.getmailspring.com/link/local-65eff3d8-39df-v1.1.4-22d9f20d@pafsanias-XPS-15-9560/0?redirect=mailto%3Asalagator%40gmail.com&recipient=403%40bugs.jt400.p.re.sourceforge.net)

      On Mar 5 2018, at 2:18 pm, John Eberhard jeberhard@users.sourceforge.net wrote:

      status: open --> closed-server-problem
      assigned_to: John Eberhard

      Comment:

      This is an application problem because the JOB ends up in MSGW. The procedure on the server will need to be fixed so that it never enters the MSGW state.

      The SQL Cancel mechanism to used to attempt to interrup the current SQL work in the job (in order to implement the query timeout). However, the SQL Cancel mechanism cannot interrupt a JOB that is in MSGW.
      [bugs:#403] (https://link.getmailspring.com/link/local-65eff3d8-39df-v1.1.4-22d9f20d@pafsanias-XPS-15-9560/1?redirect=https%3A%2F%2Fsourceforge.net%2Fp%2Fjt400%2Fbugs%2F403%2F&recipient=403%40bugs.jt400.p.re.sourceforge.net) XA rollback exception with stale transaction

      Status: closed-server-problem
      Group: JDBC
      Created: Mon Mar 05, 2018 06:51 AM UTC by Pafsanias Ftakas
      Last Updated: Mon Mar 05, 2018 06:51 AM UTC
      Owner: John Eberhard
      Attachments:

      rollback-problem.zip (https://link.getmailspring.com/link/local-65eff3d8-39df-v1.1.4-22d9f20d@pafsanias-XPS-15-9560/2?redirect=https%3A%2F%2Fsourceforge.net%2Fp%2Fjt400%2Fbugs%2F403%2Fattachment%2Frollback-problem.zip&recipient=403%40bugs.jt400.p.re.sourceforge.net) (219.0 kB; application/x-zip-compressed)
      We are running an XA transaction in which an AS400 DB2 stored procedure is called, running on a JBoss EAP which is the XA coordinator. We have set the query timeout to 10 seconds and the query timeout mechanism in the AS400 data source to have the value "cancel", so that an actual attempt to cancel the operation is made at 10 seconds after Statement::execute() is called. Additionally the JBoss JTA runtime is set so that the "reaper" mechanism that is used, kicks in after 2 minutes, finds any threads that are still working on a pending transaction and marks the thread itself as a zombie and the transaction as rollback-only. This means that "transactional" threads that take more than 2 minutes can only issue a rollback.

      In all cases the stored procedure gets stoped in a MSGW condition in IBM-i system (the data is such that an abnormal condition is triggered in the back end). The problem - as I will describe below - is that in certain situations where we expect a rollback to take place, it appears that the XA rollback command never reaches the back end system.
      We get the following stack trace in the log file of JBoss:
      10:27:35,529 WARN [com.arjuna.ats.jta] (Transaction Reaper Worker 1) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a00f0e2:72a300d8:5a990693:9b, node_name=1, branch_uid=0:ffff0a00f0e2:72a300d8:5a990693:9e, subordinatenodename=null, eis_name=unknown eis name > (org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@49fde3d8) failed with exception code -: java.lang.NullPointerException
      at java.lang.Throwable.printStackTrace(Throwable.java:653) [rt.jar:1.8.0_102]
      at java.lang.Throwable.printStackTrace(Throwable.java:721) [rt.jar:1.8.0_102]
      at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:967)
      at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:368)
      at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
      And the following stack trace in the trace file for the JDBC driver:

      java.lang.Throwable
      at com.ibm.as400.access.Trace.log(Trace.java:1164)
      at com.ibm.as400.access.Trace.log(Trace.java:1092)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3046)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
      at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
      at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
      at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
      at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
      at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
      Thread[Transaction Reaper Worker 1,5,main] Παρ Μαρ 2 10:27:35:529 EET 2018 as400: Connection LOGIASP (616916623) : Throwing exception. Original exception: .java.lang.InterruptedException
      at java.lang.Object.wait(Native Method)
      at java.lang.Object.wait(Object.java:502)
      at com.ibm.as400.access.AS400ThreadedServer.receive(AS400ThreadedServer.java:339)
      at com.ibm.as400.access.AS400ThreadedServer.sendAndReceive(AS400ThreadedServer.java:492)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3005)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
      at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
      at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
      at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
      at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
      at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
      Thread[Transaction Reaper Worker 1,5,main] Παρ Μαρ 2 10:27:35:529 EET 2018 as400: Connection LOGIASP (616916623) : Throwing exception. Actual exception: Internal driver error. (class java.lang.InterruptedException) sqlState: HY000 vendor code -99999.java.sql.SQLException: Internal driver error. (class java.lang.InterruptedException)
      at com.ibm.as400.access.JDError.createSQLExceptionSubClass(JDError.java:928)
      at com.ibm.as400.access.JDError.throwSQLException(JDError.java:559)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3052)
      at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
      at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
      at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879)
      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645)
      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118)
      at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
      at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
      at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
      We are running the following scenaria:
      - run1: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job before the 10 seconds have expired.
      - run2: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job after the 10 seconds have expired, but before the 2 minute reaper timeout.
      - run3: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "cancel" on the back end job after the 2 minute reaper timeout has expired.
      - run4: The transaction is started, a call is made to the DB2 stored procedure, it gets blocked in the MSGW, and the operator issues a "go" on the back end job after the 2 minute reaper timeout has expired.

      We would expect all scenaria to result in a rollback for the transaction. This is indeed the case, but in the last two scenaria, the rollback command issued by the JBoss JTA runtime never gets delivered to the AS400 and resources are apparenly blocked. You can see this in the attached log files of JBoss and the Toolbox trace.
      Please advise.
      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/jt400/bugs/403/ (https://link.getmailspring.com/link/local-65eff3d8-39df-v1.1.4-22d9f20d@pafsanias-XPS-15-9560/3?redirect=https%3A%2F%2Fsourceforge.net%2Fp%2Fjt400%2Fbugs%2F403%2F&recipient=403%40bugs.jt400.p.re.sourceforge.net)

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/ (https://link.getmailspring.com/link/local-65eff3d8-39df-v1.1.4-22d9f20d@pafsanias-XPS-15-9560/4?redirect=https%3A%2F%2Fsourceforge.net%2Fauth%2Fsubscriptions%2F&recipient=403%40bugs.jt400.p.re.sourceforge.net)

       

      Related

      Bugs: #403

      • John Eberhard

        John Eberhard - 2018-03-05

        If you want to demonstate how the system behaves when the job is delayed, you will need use another mechanism. The SQL system on the server cannot deal with a job that is in MSGW. If the job is really stuck on a lock, then the SQL system could handle that.

         
  • John Eberhard

    John Eberhard - 2018-03-05

    This is an application problem because the JOB ends up in MSGW. The procedure on the server will need to be fixed so that it never enters the MSGW state.

    The SQL Cancel mechanism to used to attempt to interrup the current SQL work in the job (in order to implement the query timeout). However, the SQL Cancel mechanism cannot interrupt a JOB that is in MSGW.

     
  • John Eberhard

    John Eberhard - 2018-03-06

    I have a few more comments. The problem here is that the cancel mechanism doesn't work because the job is in MSGW. This then causes the connection to be unusable, which causes the InernalDriverError that you are seeing. At this point, I don't think the application is closing the connection, which leaves the XA transaction open. This connection will need to be closed by the application before it is possible for the XA transaction to be rolled back.

    Again, the key problem here is that the execution of an SQL statement should not result in a job going into a state where it cannot accept an EVENT to abort the current operation. In this case, the MSGW does not allow the EVENT to be delivered and the connection ends up in this weird state.

     
  • Pafsanias Ftakas

    Hello again,

    Thank you for your comments. We are indeed trying to eliminate the MSGW as you suggest.

    However, following your advise, we tried to emulate a long running call to DB2 by means of locking (no MSGW condition involved). We lowered the transaction timeout on the part of the J2EE container down to 30 seconds. There was no query timeout involved. We removed the locking condition @ about 40 seconds past the beginning of the transaction.
    The behaviour was exactly the same as before. The JBoss JTA runtime marked the transaction as rollback only @ 30 seconds. When the DB2 call returned (successfully) @ 40 seconds, the runtime tried to issue a rollback, which resulted in the same error:

    17:07:57,179 WARN  [com.arjuna.ats.jta] (Transaction Reaper Worker 0) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a00f0e2:-6b080f80:5a9e9d72:205, node_name=1, branch_uid=0:ffff0a00f0e2:-6b080f80:5a9e9d72:207, subordinatenodename=null, eis_name=unknown eis name > (org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@56329afa) failed with exception code -: java.lang.NullPointerException
        at java.lang.Throwable.printStackTrace(Throwable.java:653) [rt.jar:1.8.0_102]
        at java.lang.Throwable.printStackTrace(Throwable.java:721) [rt.jar:1.8.0_102]
        at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:967)
        at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:368)
        at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
        at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
        at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    

    Colleagues that know of the back end system inform me that this particular transaction left "XA orphaned transaction" in IBM-i.
    I am attaching the logs from JBoss and the driver traces. Please take a look to see if this is indeed a problem.

     
  • John Eberhard

    John Eberhard - 2018-03-06

    Thanks for the traces, that really helps. It looks like the application has both Thread[HTTP-22,9,uq-thread-pool] and Thread[Transaction Reaper Worker 0,5,main] using the same connection. Connections are single threaded and should only be used by one connection at a time.

    It looks like the two threads are trying to use the connection at the same time.

    Thread[HTTP-22,9,uq-thread-pool] Τρι Μαρ 6 17:07:57:100 EET 2018 Data stream sent (connID=644049971) ...
    00 00 00 5E 00 00 E0 04 00 00 00 00 00 00 00 A2
    00 14 18 16 84 04 00 00 00 00 00 00 00 01 00 01
    00 00 00 01 00 00 00 06 00 00 00 0A 38 18 00 00
    06 00 00 00 00 0A 38 19 00 00 06 C6 00 00 00 0A
    38 1A 00 00 00 00 00 00 00 07 38 1B F0 00 00 00
    07 38 21 F1 00 00 00 0A 38 28 00 00 00 03

    Thread[Transaction Reaper Worker 0,5,main] Τρι Μαρ 6 17:07:57:101 EET 2018 Data stream sent (connID=644049971) ...
    00 00 00 CE 00 00 E0 0A 00 00 00 00 00 00 00 A3
    00 14 18 A3 80 04 00 00 00 00 00 00 00 00 00 00

    Thread[HTTP-22,9,uq-thread-pool] Τρι Μαρ 6 17:07:57:101 EET 2018 Data stream sent (connID=644049971) ...
    00 00 00 28 00 00 E0 04 00 00 00 00 00 00 00 A4
    00 14 1E 01 00 04 00 00 00 00 00 00 00 01 00 01
    00 00 00 01 00 01 00 00

    For some reason, the Reaper thread is being interrupted.

    Thread[Transaction Reaper Worker 0,5,main] Τρι Μαρ 6 17:07:57:101 EET 2018 Exception does not contain a message.
    java.lang.InterruptedException
    at java.lang.Object.wait(Native Method)
    at java.lang.Object.wait(Object.java:502)
    at com.ibm.as400.access.AS400ThreadedServer.receive(AS400ThreadedServer.java:339)
    at com.ibm.as400.access.AS400ThreadedServer.sendAndReceive(AS400ThreadedServer.java:492)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:3005)
    at com.ibm.as400.access.AS400JDBCConnection.sendAndReceive(AS400JDBCConnection.java:2941)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:343)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)

    Are you sure the application is configured correctly? It seems odd that both threads are using the same connection.

    Do you have any idea why some other thread is issuing Thread.interrupt() against the reaper thread?

    Even though the reaper thread was interrupted, I do see the xa_end completion being returned to the client.

    Data Requesting Function: 0x18A3 - XA_End
    ORS Requesting Function: 0x18A3 - XA_End
    Error Class: 0x0009 - MTS Condition
    Error Return Code: 100 - The rollback was caused by an unspecified reason.

    I didn't see any attempt to rollback the xa transaction. I wonder if the occurred because the reaper thread was interrupted. Any idea why the transaction manager never rolled back the transaction?

    So here are the problems that I see.

    1. Two threads are using the same connection at the same time.
    2. There is some thread that is interrupting the reaper thread
    3. The rollback for the xa transaction is never issued, resulting in the orphaned transaction.
     
  • Pafsanias Ftakas

    It is indeed the case that the XA rollback is issued by a different thread. Here is what RedHat has to say on the subject: https://access.redhat.com/solutions/167033 (in case it helps). It seems that this is the way that Arjuna (RedHat's implementation of the JTA) is trying to deal with long running transactions and graceful release of resources in such cases. I will try to compare with logs where the transaction rolls back successfully (although as you say this happens on the same thread as the one starting the transaction), in case I can figure out what is different.

     
  • John Eberhard

    John Eberhard - 2018-03-06

    Have you tried the case where the "query timeout mechanism=cancel" and you set a query timeout on the statement objects and the query timeout is much less than the transaction timeout.

     
  • Pafsanias Ftakas

    Yes, this works fine, because everything happens on the same thread, as you pointed out. However, since we are orchastrating multiple DB2 statement executions within the same transaction, it is possible that the whole transaction might exceed the J2EE transaction timeout, even though individual calls have not exceeded the query timeout threshold.

     
  • John Eberhard

    John Eberhard - 2018-03-06

    I've discovered there is an bug in the tracing mechanism that causes the following exception.

    java.lang.NullPointerException
    at java.lang.Throwable.printStackTrace(Throwable.java:653) [rt.jar:1.8.0_102]
    at java.lang.Throwable.printStackTrace(Throwable.java:721) [rt.jar:1.8.0_102]
    at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:967)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:368)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)

    What error do you see in the JBossTrace when JDBC tracing is not enabled?
    I'm wondering if JBoss does different recovery if it sees a different exception.

     
  • Pafsanias Ftakas

    I have to try not having JDBC tracing enabled to answer your question. I turned all tracing that I could think off "on", since I wanted to find out what was happening. Bear with us, we will test this tomorrow and I will get back to you....

     
  • Pafsanias Ftakas

    It turns out that turning off tracing did make a difference in the stack trace that we get:

    12:34:39,676 WARN [org.jboss.jca.core.connectionmanager.listener.TxConnectionListener] (Transaction Reaper Worker 4) IJ000305: Connection error occured: org.jboss.jca.core.connectionmanager.listener.TxConnectionListener@263d0f5a[state=NORMAL managed connection=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@743c0faf connection handles=1 lastUse=1520418825099 trackByTx=true pool=org.jboss.jca.core.connectionmanager.pool.strategy.OnePool@1adc1f34 pool internal context=SemaphoreArrayListManagedConnectionPool@51131444[pool=as400xa-pool] xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@743c0faf txSync=null]: javax.transaction.xa.XAException
    at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:970)
    at com.ibm.as400.access.AS400JDBCXAResource.end(AS400JDBCXAResource.java:368)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.end(XAManagedConnection.java:295)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:320) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]

    12:34:39,896 WARN [com.arjuna.ats.jta] (Transaction Reaper Worker 4) ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=29, bqual_length=36, tx_uid=0:ffff0a00f0e2:-50cbf11f:5a9fa3f4:2e9, node_name=1, branch_uid=0:ffff0a00f0e2:-50cbf11f:5a9fa3f4:2ec, subordinatenodename=null, eis_name=unknown eis name > (org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@743c0faf) failed with exception code XAException.XAER_RMFAIL: javax.transaction.xa.XAException
    at com.ibm.as400.access.AS400JDBCXAResource.throwXAException(AS400JDBCXAResource.java:970)
    at com.ibm.as400.access.AS400JDBCXAResource.rollback(AS400JDBCXAResource.java:805)
    at org.jboss.jca.adapters.jdbc.xa.XAManagedConnection.rollback(XAManagedConnection.java:346)
    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:338) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2900) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:2879) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1645) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:118) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]
    at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78) [jbossjts-jacorb-4.17.34.Final-redhat-1.jar:4.17.34.Final-redhat-1]

    12:34:39,896 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 4) ARJUNA012091: Top-level abort of action 0:ffff0a00f0e2:-50cbf11f:5a9fa3f4:2e9 received TwoPhaseOutcome.FINISH_ERROR from com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord
    12:34:39,896 WARN [com.arjuna.ats.arjuna] (Transaction Reaper Worker 4) ARJUNA012113: TransactionReaper::doCancellations worker Thread[Transaction Reaper Worker 4,5,main] missed interrupt when cancelling TX 0:ffff0a00f0e2:-50cbf11f:5a9fa3f4:2e9 -- exiting as zombie (zombie count decremented to 0)

    Also, the DB2 execution returns with an exception: java.sql.SQLException: Communication link failure. (Disconnect request received, connection terminated.)

    However the transaction leaves "XA orphaned transactions" in the back end (we see changes made during the transaction after the J2EE server had terminated the transaction on the JBoss side). I am attaching the logs from JBoss in case you can gleam some information out of it.

     
  • John Eberhard

    John Eberhard - 2018-03-07

    Now we see the rollback actually being attempted and failing. However, without the trace, we don't know what is going on. What level of JDK is the application using? Once I know that, I can provide a new jt400.jar file with corrected trace code.

     
  • John Eberhard

    John Eberhard - 2018-03-07

    Can you try using the appropriate jar file from https://sourceforge.net/projects/jt400/files/Beta/beta20180307.zip/download ?

    Archive: beta20180307.zip
    Length Date Time Name
    -------- ---- ---- ----
    4844178 03-07-18 01:10 jt400.jar
    4789470 03-07-18 01:23 java6/jt400.jar
    4804083 03-07-18 01:32 java8/jt400.jar
    4155663 03-07-18 01:38 java9/jt400.jar
    -------- -------

    Use the jt400.jar file corresponding to the level of Java you are running.

    Please run with all JTOpen tracing enabled.

     
  • Pafsanias Ftakas

    Please find attached the JBoss and Toolbox traces for the same scenario, but with the modified JDBC driver to bypass the issue with the tracing turned on.

    Also, I would like to inform you that we have opened an issue with RedHat on the same problem to see what the opinion from the side of the coordinator is. The people from RedHat have seen this thread. Unfortunately their reply needs a login in order to access it, so I wanted to share it with you in case it gives you some insight into the problem and how to find a means around it. Here goes:

    _BEGIN COPIED MESSAGE_
    Hi Pafsanias

    This looks to me like a classic case of a problem falling through the cracks between different vendor's interpretation of an interoperability specification, in this instance XA.

    Since you've already read https://access.redhat.com/solutions/167033' (which I vaguely recall having a hand in writing) you're probably up to speed on the key points already, but to recap:

    The transaction reaper is a background process that handles client-side rollback of transactions that have exceeded their configured lifetime. It does this on a separate thread, since it's risky and largely ineffective to interrupt the business logic thread. In cases where the XAResource supplied by the driver is sending the XA control logic down the same blocking TCP/IP connection as the business logic (SQL), even the separate thread approach is not always effective, since the XA termination gets queued, usually on a driver lock, pending the completion of the SQL. At scale this causes resource (thread) starvation that can bring down the app server, a problem which we alleviate by interrupting the stuck rollback threads after a configurable interval.

    For such failure cases there is no retry mechanism. That's because of XA's presumed-abort design, under which the db server should simply rollback the transaction without explicit instruction from the transaction manager, if it hits the timeout. That is, the db server should have its own equivalent of the reaper process that will operate independently to provide some redundancy. It's necessary to deal with connection loss cases, but also helps here.

    There are a number of points where this model can go wrong if the parties don't agree on the finer points of how it works. Many drivers make the decision to synchronize operations on an XAResource, a limitation that makes their life easier but reduces their ability to handle client side timeouts correctly. A better XAResource implementation would wrap an asynchronous wire protocol that would allow the XA 'end' to be sent to the server immediately, even whilst the SQL operation is still outstanding, or equivalently to send XA control operations over a separate TCP/IP connection if the wire protocol for the SQL is synchronous blocking. But the world is not an ideal place and most drivers don't do that. Fortunately it usually doesn't matter, because the server they are connecting to should timeout the transaction autonomously. However, servers occasionally override or ignore the timeout configuration sent from the client. Or it can interpret the timeout as applying only in connection loss cases, which is incorrect. When these problems are combined, this can lead to transactions which will never be terminated.

    The timestamps from the trace file you provided are particularly instructive. It's clear that the driver is holding up the rollback because it's busy with the SQL. Note that whilst the driver can (and does) choose not to support concurrent operations, it's not an error for the server to use them - the spec doesn't disallow it, the vendor implementation of the driver does. There is no reasonable way the server can detect this and the only viable workaround is for it to use a separate XAResource (i.e. connection) to send the control traffic. There is scope for an a product RFE here, but frankly it's not something that's likely to have much priority on the feature backlog.

    Next we have the interrupt of the stuck rollback. That's a safety feature that we recommend keeping in place to prevent the server running out of threads, but you can tune the interval or simply set it to something so long it's not relevant. Note that you may want to increase the thread pool size and server thread limits to prevent resource shortages if you go down that path. In such configurations, the reaper's rollback operation will eventually be processed once the long running SQL completes. That's great from a point of view of ensuring you don't get abandoned transactions left in the database, but doesn't help if the tx is holding locks that need to be released in a more timely manner.

    The more robust solution, is to configure the database to respect the transaction timeout and interrupt the SQL itself. In such cases the reaper and database will both terminate the transaction, which in the worst situation can lead to race conditions, but which does guarantee the transaction will rollback and release any locks or other resources promptly. From the sounds of things the use of MSGW may complicate this for some of your use cases, but it should still be a good solution for some cases. We've recently seen a similar support issue that was successfully resolved by the user changing their AS400 DB2 server defaults, which unfortunately ignore the timeout setting the client provides at transaction start.

    So, if we revisit John Eberhard's 'problems':

    1.Two threads are using the same connection at the same time.

    Concurrent XAResource operations are not disallowed by the spec, since the driver need not necessarily implement them over the same blocking connection. The driver is free to regard it as a problem, but ideally should not do so unless it has some better alternative to offer.

    2.There is some thread that is interrupting the reaper thread

    Again, that's intended and not a problem. The alternative of NOT interrupting the thread may be at least as bad, as discussed above.

    3.The rollback for the xa transaction is never issued, resulting in the orphaned transaction.

    It is issued, but the interrupt later effectively cancels it if the driver hasn't acted on it promptly. It's not re-issued after that failure, because of presumed abort.

    The other side of the story is the points the driver vendor is not making:

    a) The driver doesn't handle client-side timeouts promptly, instead forcing them to wait on long-running SQL operations. That's perhaps not the best design decision, but alternative options may be limited by the wire protocol.

    b) The server doesn't independently rollback the transaction on timeout, though to be fair that may simply be a configuration issue.

    So to summarise, I'd advice:

    • Validate the AS400's DB2 timeout configuration and behavior, since that should at least deal with the non-MSGW cases.

    • If desired, raise the reaper cancellation period to prevent the interrupt behavior so that the rollback eventually gets sent, but take care that your app server has enough capacity to tolerate the thread pool pressure that may result.

    • Optionally, also request an RFE to implement alternative handling for this scenario in the server that is better suited to the specific limitations of the driver you've selected.

    Regards

    Jonathan Halliday
    middleware engineering.
    _END COPIED MESSAGE_

     
  • John Eberhard

    John Eberhard - 2018-03-13
    • status: closed-server-problem --> open
     
  • John Eberhard

    John Eberhard - 2018-03-13

    Ok, I checked into the transaction timeout that the JTOpen driver sets. It turns out that the server will mark the transaction as rollback only after the timeout. The server will also attempt to rollback the transaction, but it is only able to do that if the transaction is not in use. If you want to pursue this from the server side, you will need to submit a request for a server enhancement for the lock timeout to interrupt current SQL work before attempting to rollback the transaction.

    From what I understand there are the following settings.

    1. Transaction timeout t
    2. Statement timeout s
    3. Reaper cleanup timeout r

    From what I understand, when the transaction timeout occurs, a reaper thread is started after the transaction timeout t occurs. This thread tries to use the connection to do work, but the thread is being used by a statement with statement timeout s. If s is greater than r, then the reaper thread is unable to do anything before reaper cleanup begins. When repeater cleanup begins, it sends a thread interrupt to the reaper thread. Unfortunately, the architecture of the Java thread interrupt is such that it is only delivered at certain times. When the reaper threads is actually able to use the connection -- either because of the statement timeout or the release of the lock -- it then uses the connection. At this point, the thread interrupt is delivered and the connection become unusable because the connection is confused.

    Can you try the scenario where the statement timeout is several seconds less than the reaper cleanup timeout? In this case, the reaper thread will be able to do its work before the reaper cleanup timeout will interrupt the reaper thead.

    Another option is to enhance the JTOpen driver to keep track of the transaction timeout and set the statement timeout so that a statement in a transaction will timeout with the transaction. I still need to put a little more though into that, but I would like to know if setting the statement timeout to be less thatn the reaper cleanup timeout will fix the problem.

     
  • Pafsanias Ftakas

    It has been a long time, but this problem has reared it head again. I figured out how to stop the reaper thread from effectively performing an interrupt at all. This has the result that on the reaper working thread, both xa_end and xa_rollback are called, and they are successful (ie, the communication reached the IBM-i and therefore no "orphaned transactions" are left on the DB2 side). This is all good.

    However, when a subsequent call comes along, something in the connection (?) is in an incorrect state, and the next xa_start that is issued by the server is met with an error from the server (error class 9, return code = -3). This reaches the J2EE transaction handling code as an SQL exception and the operation fails. Calls after this work OK, so the connectivity to the server appears to recover, but this behaviour is problematic. Is there some way to use a connection validator to find that the connection is in such a state, and if so what actions should it perform? It seems that the JBoss connection validator which simply performs a select is not enough to catch this condition. Or, is there some action we can take on the transaction reaper thread to make sure that the state of the connection/server object on the driver's side are OK after the rollback?

    The offending communication with the server is as follows:

    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:461 EEST 2018 as400: XAConnection com.ibm.as400.access.AS400JDBCXAConnection@78ba74bf (2025485503) : A new AS400JDBCPooledConnection was created.
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 as400: XAConnection com.ibm.as400.access.AS400JDBCXAConnection@78ba74bf (2025485503) : AS400JDBCPooledConnection.getConnection() was called.
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 Getting implementation VRM.
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 Implementation VRM:
    00 07 03 00
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 as400: XAResource LOGIASP-XA:RMID#49155 (869482078) : xa_start.
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 send and receive(): ...
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 send(): send request...
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 Data stream sent (connID=378318245) ...
    00 00 00 D8 00 00 E0 0A 00 00 00 00 00 00 00 07
    00 14 18 A9 80 04 00 00 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 04 00 00 00 0A 38 A0 00 00
    C0 03 00 00 00 92 38 A2 00 02 00 05 00 00 00 1D
    00 00 00 24 00 00 00 00 00 00 00 00 00 00 FF FF
    0A 03 17 30 61 DD E8 E7 5B 50 93 4E 00 00 00 39
    31 00 00 00 00 00 00 00 00 00 00 FF FF 0A 03 17
    30 61 DD E8 E7 5B 50 93 4E 00 00 00 3B 00 00 00
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 0A 38 A5 00 00 00 00 00 00
    00 0A 38 A7 00 00 00 0A
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 AS400Server.receive
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:462 EEST 2018 receive(): Reply not found. Waiting...
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 Data stream data received (connID=378318245) ...
    00 00 00 28 00 00 E0 0A 00 00 00 00 00 00 00 07
    00 14 28 00
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 Data stream data received (connID=378318245) ...
    80 04 00 00 00 00 00 00 00 00 18 A9 18 A9 00 09
    FF FF FF FD
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 run(): Adding reply: 7
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 run(): Notifying threads.
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 run(): Threads notified.
    Thread[AS400 Read Daemon [system:10.29.70.57;job:599772/QUSER/QZDASSINIT],5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 run(): Waiting for reply...
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 receive(): Valid reply found: 7
    Thread[http-0.0.0.0:8443-1,5,main] Πεμ Ιουλ 19 16:36:54:480 EEST 2018 as400: XAResource LOGIASP-XA:RMID#49155 (869482078) : xa error class = 9, return code = -3.

     
    • John Eberhard

      John Eberhard - 2018-07-23

      Can you attach the complete trace?

       
  • Pafsanias Ftakas

    I attach two runs.

    In both cases I run the same scenario: a transaction is started, which calls a long running operation (it takes about 20 seconds) on IBM-i. The transaction timeout is set to 10 seconds. After 10 seconds the transaction reaper worker thread kicks in and attempts to stop the long running transaction. It fails to do so, so the jboss transaction runtime marks the transaction as rollback only and the thread it is running on as a zombie. Later on, on the thread that is executing the transaction, the IBM i call returns. The reaper worker thread wakes up, executes the xa_end and xa_rollback and finishes. Meanwhile the thread that is executing the transaction throws exceptions, as the transaction is rollback only and all subsequent operations fail.

    Then the second transaction comes along, the JBoss transaction runtime attempts to start an XA transaction on the connection, calls xa_start and we get an error back from the server.

    In an effort to see if starting on a "clean" connection might bypass the problem, I modified the rollback code to also close the connection, in order to force the next operation to get a fresh one. This seems to work (this is my second run) as far as getting a new connection when the JBoss transaction runtime realizes that the connection is not good during validation. However, I also get the same response from the server on the different connection. Is it possible that the IBM-i job that handles the connections is the same in both cases and is somehow corrupt?

     
  • John Eberhard

    John Eberhard - 2018-07-24

    Sounds like the original problem where the reaper thread is timing out. If the repear thread cannot do its work, the state of the connection is unknown and the connection will need to be discarded. It looks like you need to make sure that the query timeout value is less than the reaper timeout so that the reaper thread never times out. The error from xa_start occurs because because the xa_end for the previous xa_start was not processed on that same thread / connection.

     
  • Pafsanias Ftakas

    It is interesting that you should say that. I wrote a little test program that uses the driver directly (no JCA), but uses the strategy that IronJacamar seems to take: when a (simulated) transaction timeout occurs, the xa_end and xa_rollback will take place on a different thread than the one where the xa_start was initiated from (essentially the thread on which the RollbackRunnable is running on, is the equivalent of the transaction reaper thread). Whenever I run this test, all connections and calls work fine. No errors are returned by the xa_start on subsequent operations after "off thread" rollbacks occur. Which leads me to believe that there is nothing wrong with calling the xa_end and xa_rollback on a different thread as long as one is consistent with the connection object. Reading through the trace files, it seems that JBoss is making the same calls, but I am not profficient in making this determination.

    Which makes me wonder if there is something else that the JBoss transaction infrastructure is doing that "tickles" the connection in the wrong way.

    Please let me know what you think of all this.

     

    Last edit: Pafsanias Ftakas 2018-07-24
  • John Eberhard

    John Eberhard - 2020-11-17
    • status: open --> closed
     
  • John Eberhard

    John Eberhard - 2020-11-17

    Closing old issue.

     

Log in to post a comment.