Menu

Grails - Can't get HA-JDBC to properly fallback if "primary DB fails"

Help
2014-03-10
2014-04-27
  • wholenewstrain

    wholenewstrain - 2014-03-10

    I got everything working I can start DB1, DB2, DB3 with:

    java -cp h2-1.3.175.jar org.h2.tools.Server -tcp -tcpPort 910X -baseDir DBX_test

    Then in Grails I use:

    driverClassName = "net.sf.hajdbc.sql.Driver"
    url = "jdbc:ha-jdbc:clustertest"

    I also use uuid key generation algorithm (Hibernate).
    I can save domain classes with relationships etc. I can verify all three DB have the same data etc.

    Issues:
    1. When I kill DB1 - the server does not fallback to DB2 or DB3 even though it shows all three as active. When I kill DB2or DB3 instead of the first defined DB all is fine.

    2014-03-09 21:01:53,510 [http-bio-8080-exec-8] ERROR util.JDBCExceptionReporter - Connection is broken: "java.net.ConnectException: Connection refused: connect: localhost:9101" [90067-173]
    2014-03-09 21:01:53,563 [http-bio-8080-exec-8] ERROR transaction.JDBCTransaction - Could not toggle autocommit Message: Database is already closed (to disable automatic closing at VM shutdown, add ";DB_CLOSE_ON_EXIT=FALSE" to the db URL)

    1. Is it hard to add better synchronization for H2 DB? H2 is able to fully synchronize when in cluster mode and its all Java, so I presume it would be possible to implement Dump/Restore fairly easy.

      this will synchronize two H2 instances
      java -cp h2-1.3.175.jar org.h2.tools.CreateCluster -urlSource jdbc:h2:tcp://localhost:9103/test -urlTarget jdbc:h2:tcp://localhost:9101/test -user sa

    2. Why diff synchronization requires primary key ? There wont be primary key in join tables generated by Grails.

    3. jdbc:h2:/ha-jdbc-cluster-state-db/clustertest is empty all the time. It has all data structures but its empty.

    Please help. :)

    ha-jdbc-clustertest.xml contents below:

    <?xml version="1.0"?>
    <ha-jdbc xmlns="urn:ha-jdbc:cluster:3.0" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="urn:ha-jdbc:cluster:3.0
    http://ha-jdbc.github.io/xsd/ha-jdbc-3.0.0.xsd">
         <distributable id="jgroups"/>
         <sync id="full">
             <property name="fetchSize">1000</property>
         </sync>
         <state id="sql">
             <property
    name="urlPattern">jdbc:h2:/ha-jdbc-cluster-state-db/{0};MVCC=TRUE;AUTO_RECONNECT=TRUE;DB_CLOSE_ON_EXIT=FALSE</property>
             <property name="user">sa</property>
             <property name="password"></property>
         </state>
         <cluster meta-data-cache="shared-eager" dialect="h2" 
    default-sync="full" balancer="load" durability="fine" 
    detect-sequences="false" detect-identity-columns="false">
             <database id="db1" 
    location="jdbc:h2:tcp://localhost:9101/test;MVCC=TRUE;AUTO_RECONNECT=TRUE;DB_CLOSE_ON_EXIT=FALSE" 
    weight="1">
                 <user>sa</user>
                 <password></password>
             </database>
             <database id="db2" 
    location="jdbc:h2:tcp://localhost:9102/test;MVCC=TRUE;AUTO_RECONNECT=TRUE;DB_CLOSE_ON_EXIT=FALSE" 
    weight="1">
                 <user>sa</user>
                 <password></password>
             </database>
             <database id="db3" 
    location="jdbc:h2:tcp://localhost:9103/test;MVCC=TRUE;AUTO_RECONNECT=TRUE;DB_CLOSE_ON_EXIT=FALSE" 
    weight="1">
                 <user>sa</user>
                 <password></password>
             </database>
         </cluster>
    </ha-jdbc>
    
     

    Last edit: wholenewstrain 2014-03-10
  • Paul Ferraro

    Paul Ferraro - 2014-03-10
    1. HA-JDBC leverages the Dialect.indicatesFailure(SQLException) method to determine whether a given SQLException is due to a database failure.
      The default implementation looks like this:
      https://github.com/ha-jdbc/ha-jdbc/blob/3.0.0/src/main/java/net/sf/hajdbc/dialect/StandardDialect.java#L513

    Currently, the H2 dialect does not override this behavior. At the very least, we should override indicatesFailure(int) to return true when the error code is 90067. What other codes should we interpret as a failure?

    1. Implementing dump/restore support for H2 is just a matter of implementing H2Dialect.getDumpRestoreSupport(). Would you be willing to take a crack at implementing this?

    2. The diff strategy requires a primary key because it needs to determine whether the data in a given row has changed. Doing this requires that each row be identifiable, though the primary key does not have to be a single column. Perhaps this can be achieved using JDBC4 RowIDs - I haven't investigated this yet.

    3. Re: the empty cluster state database. Someone else has reported that as well, but I'm still uncertain why this is. My hunch is that H2 is storing the database contents in memory (the contents of this file would be very small), and only flushes to disk when the appropriate buffers are full. Can H2 be tuned with a more aggressive fsync policy?

     
  • wholenewstrain

    wholenewstrain - 2014-03-11

    Wow man that was quick!
    Thank you !!!

    I will test your fix and will let you know the outcome.

    1. I can take a stab at this and maybe I won't break anything :)
    2. If the diff PK requirement can be worked out it would be very useful piece of the puzzle.
    3. I don't think its fsync issue because I'm accessing this cluster state DB from the same JVM as HA-JDBC is. Therefore, I HAVE to see the same data as the cluster state manager. I use H2 web console. It grabs the JDBC connection and I run select and it show all the tables empty. I presume as long as HA-JDBC uses JDBC it has to see the same data. Therefore, I presume something is not being persisted but I might be wrong. Someoune has mentioned it helped when they moved the jars to global classloader. If it is the case then I would expect to see something in the logs since I've configured log4j to DEBUG.
     
  • Paul Ferraro

    Paul Ferraro - 2014-03-11

    Great. I wonder if HSQLDB implements dump/restore in a similar fashion.

    I'll look into the potential of using RowIDs - we'd only need to use them for tables w/out primary keys. I don't think it would be too hard.

    As far as the sql state manager w/H2, I'm at a bit of a loss as to why this is happening. The code for this is in SQLStateManager.
    https://github.com/ha-jdbc/ha-jdbc/blob/3.0.0/src/main/java/net/sf/hajdbc/state/sql/SQLStateManager.java

    All access is JDBC-based, where each operation uses a separate transaction composed of a batch of statements executed with default isolation. Why this data isn't then visible to other threads is a bit mysterious. Let me know if you see any obvious problems.

     
  • wholenewstrain

    wholenewstrain - 2014-03-13

    Tested ha-jdbc-3.1.0-alpha-1-SNAPSHOT.jar and I'm still getting error after killing first defined DB.
    Also looked at H2 synchronization and it seems easy however, current interface requires ProcessBuilders etc. which might be unnecessary in this scenario.

    Basically to synchronize H2 code is doing this:
    On source db does:

    DriverManager.getConnection(urlSource + ";CLUSTER=''", user, password);
    connSource.createStatement();
    set EXCLUSIVE 2; -- set exclusive mode
    SCRIPT; -- get create script
    

    Then on target:

    DriverManager.getConnection(urlTarget + ";CLUSTER=''", user, password);
    connTarget.createStatement();
    DROP ALL OBJECTS DELETE FILES
    Executes SQL script created from source
    

    And finally:

    statSource.execute("SET EXCLUSIVE FALSE");
    close both statements and connections
    

    This can be worked out as a script but requires finding classpath and Java process, plus the SQL might be too long. I believe it might be safer to do in Jdbc especially that there is H2 class CreateCluster with public execute() method that will do all we need :)

    #take source from #3    
    java -cp h2-1.3.175.jar org.h2.tools.Shell -user sa -password "" -url jdbc:h2:tcp://localhost:9103/test;CLUSTER='' -sql "set Exclusive 2; script; set EXCLUSIVE FALSE;"
    
    #restore on target DB #1
    java -cp h2-1.3.175.jar org.h2.tools.Shell -user sa -password "" -url jdbc:h2:tcp://localhost:9101/test;CLUSTER='' -sql "long SQL"
    

    The ;CLUSTER='' part might be not needed but need to check that.
    Below is the error i'm getting when I shutdown first DB (DB1):

    ~~~~~~
    2014-03-13 00:17:39,423 [http-bio-8080-exec-10] ERROR invocation.InvokeOnManyInvocationStrategy - A result [org.h2.jdbc.JdbcSQLException: Connection is broken: "java.net.ConnectException: Connection refused: connect: localhost:9101" [90067-173]] of an operation on database db2 in cluster net.sf.hajdbc.sql.DatabaseClusterImpl@684e89b2 did not match the expected result [null] Error |
    2014-03-13 00:17:39,424 [http-bio-8080-exec-10] ERROR invocation.InvokeOnManyInvocationStrategy - A result [org.h2.jdbc.JdbcSQLException: Connection is broken: "java.net.ConnectException: Connection refused: connect: localhost:9101" [90067-173]] of an operation on database db3 in cluster net.sf.hajdbc.sql.DatabaseClusterImpl@684e89b2 did not match the expected result [null]
    Error |
    2014-03-13 00:17:39,425 [http-bio-8080-exec-10] ERROR util.JDBCExceptionReporter - Connection is broken: "java.net.ConnectException: Connection refused: connect: localhost:9101" [90067-173]
    Error |
    2014-03-13 00:17:39,490 [http-bio-8080-exec-10] ERROR transaction.JDBCTransaction - Could not toggle autocommit
    Message: Database is already closed (to disable automatic closing at VM shutdown, add ";DB_CLOSE_ON_EXIT=FALSE" to the db URL) [90121-173]
    ~~~~~~~

     
  • Paul Ferraro

    Paul Ferraro - 2014-03-13

    You'll want to test against the 3.0 branch, not master. I haven't pulled these fixes into master yet.

     
  • Paul Ferraro

    Paul Ferraro - 2014-03-13

    Most databases rely on an external executable script to handle this stuff. We can still use JDBC to do this for H2 while still conforming to the DumpRestoreSupport interface.

    e.g.

    public ProcessBuilder createDumpProcess(final ConnectionProperties properties, final File file)
    {
        return new ProcessBuilder()
        {
            @Override
            public Process start()
            {
                // Put JDBC code here
                // direct output of dump into File
    
                return new Process()
                {
                    @Override
                    public int exitValue()
                    {
                        return 0;
                    }
    
                    @Override
                    public int waitFor()
                    {
                        return this.exitValue();
                    }
    
                    // Add no-op implementations of abstract methods
                }
            }
        }
    }
    

    I'll create these "dummy" Process/ProcessBuilder implementations for your use in net.sf.hajdbc.util.

     
  • Paul Ferraro

    Paul Ferraro - 2014-03-13

    Ah crap. That won't work. ProcessBuilder is final! I'll have to generalize DumpRestoreSupport.

     
  • Paul Ferraro

    Paul Ferraro - 2014-03-13

    Ok - I've committed changes to DumpRestoreSupport that should allow you to more easily implement dump/restore support for H2.

    DumpRestoreSupport.java now looks like:

    public interface DumpRestoreSupport
    {
        <Z, D extends Database<Z>> void dump(D database, Decoder decoder, File file) throws Exception;
    
        <Z, D extends Database<Z>> void restore(D database, Decoder decoder, File file) throws Exception;
    }
    

    You can obtain a connection to the database via:

    Connection c = database.connect(database.getConnectionSource(), database.decodePassword(decoder));
    

    When you're ready, just submit your changes as a pull request.

     
  • wholenewstrain

    wholenewstrain - 2014-03-15

    Hi Paul,
    I've spent some half of the night trying to debug the fallback behaviour yesterday/today on H2 DB. Used ha-jdbc-3.0.1-SNAPSHOT version. I wasn't able to fix it 100% as there is many proxies flying around, however I can describe you whats going on and you can refactor and find proper place for the fix.

    When app starts all three H2 instances are running. I read some records, then I kill DB1 the first db defined in the config. I refresh the page the connection will throw CONNECTION_BROKEN_1 = 90067 error code inside InvokeOnAnyInvocationStrategy. The code after your fix will detect that this is a failure and deactivates DB1. However, any further invocations will still use DB1 connection which is now closed (DB1 is shutdown). So, the next time InvokeOnAnyInvocationStrategy is executed it will try DB1 again and this time it will will throw DATABASE_CALLED_AT_SHUTDOWN = 90121 exception which was not on the failure codes list. So, it throws it out to the app. I've added 90121 as failureCode also plus URL_FORMAT_ERROR_2 = 90046 for good measure and tried again. This time fallback worked however, the loop inside InvokeOnAnyInvocationStrategy always starts with DB1 and throws 90121 inside InvokeOnAnyInvocationStrategy which slowed down the app a little but it worked OK since it was recognized as failure and effectively ignored. However, when I killed DB2, so that only single DB3 was alive. The statements that were executed starting from DB1 and throwing 90121 failures would now prevent DB3 from being ever invoked since cluster.getBalancer().size() > 1 was now false (only DB3 is active). I've added:

    if (!database.isActive()) {
       continue;
    }
    

    Inside InvokeOnAnyInvocationStrategy right after you get Database from ConnectionProxyFactory line #66 so, we don't try to execute anything on inactive DB. This helped and all was working fast again and falling over to the next available DB. The problem is then that after you activate DB1 it uses full strategy and activeDatablases in the MBean include db1. However, when you try to execute anything it will again use the same connection in InvokeOnAnyInvocationStrategy since DB1 is active my check will allow execution but the underlying connection is not the same that was used for restore but it is still the one that was killed and it will throw 90121 again. I think when you deactivate DB you should recycle all the connections to that DB. I wasn;t sure where is the best place to do that. I tried to close connection inside InvokeOnAnyInvocationStrategy as POC so when we catch Exception and exceptionFactory creates actual SQL exception I've added:

    if (exceptionFactory.indicatesFailure(exception, dialect) && map instanceof ConnectionProxyFactory) {
        ((ConnectionProxyFactory) map).getTransactionContext().close();
        ((ConnectionProxyFactory) map).remove();
        Resources.close((Connection)entry.getValue());
    }
    

    But it did not work and its hard for me to suggest where it should go but I would suggest closing connection to failed DB on deactivate and reopening on activation. Otherwise H2 DB is not able to recover since the used connection is down and has to be recysled. When I forced many requests after DB activation H2 connection finally went live I guess with many requests another connection was somehow created. I won't be doing pull request as this was more of a debugging session than actual coding, so please modify in the repo if you find my suggestions logical or feel free to achieve the same logic by more appropriate code.

    Last thing, I also looked at the issue of SQLStateManager not preserving cluster state. I put breakpoints inside activated(), deactivated(), setActiveDatabases() It turned out they were never invoked I specifically checked during startup and afterward.

    To summarize:
    - Please, add 2 failure codes for H2Dialect 90121, 90046
    - Make sure databases used in InvokeOnAnyInvocationStrategy are active
    - Make sure connections to failed DB are recycled and re-created on activation
    - Mkae sure state manager is invoked on activate, deactivate etc.

    I will work on H2 restore next hopefully will be able to create pull request :)
    Thanks.

     
  • wholenewstrain

    wholenewstrain - 2014-03-16

    Ok, tested your fixes.

    1. Failure codes work as expected. - OK
    2. Fallback works as expected even if single DB is left. Inactive databases are not being called anymore. - OK
    3. State manager is persisting cluster state in DB - OK
    4. After a failed H2 DB is activated any SQL invoked will fail due to an error 90121 - FAIL. Somehow HA-JDBC is reusing the same connection??? When I disable active DB check and force many calls to JDBC it will finally connect to the activated instance and it works OK going forward. With failed DB check it will only run one SQL on freshly activated DB and will deactivate it, so to observe this I disabled active/inactive check temporarily.

    Ad. #3

    SELECT * FROM CLUSTER_STATE;
    DATABASE_ID  
    db3
    db2
    

    Ad. #4

    2014-03-16 16:59:43,011 [RMI TCP Connection(4)-192.168.0.5] INFO  sql.DatabaseClusterImpl  - Finished synchronization of database clustertest of cluster db1
    2014-03-16 16:59:43,047 [RMI TCP Connection(4)-192.168.0.5] INFO  sql.DatabaseClusterImpl  - Database clustertest from cluster db1 was activated
    Error |
    2014-03-16 16:59:51,004 [http-bio-8080-exec-6] ERROR invocation.InvokeOnManyInvocationStrategy  - Database db1 from cluster clustertest was deactivated
    Message: Database is already closed (to disable automatic closing at VM shutdown, add ";DB_CLOSE_ON_EXIT=FALSE" to the db URL) [90121-173]
    Line | Method
    ->>  331 | getJdbcSQLException in org.h2.message.DbException
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
    |    171 | get                 in     ''
    |    148 | get . . . . . . . . in     ''
    |    137 | get                 in     ''
    |   1413 | checkClosed . . . . in org.h2.jdbc.JdbcConnection
    |   1388 | checkClosed         in     ''
    |    525 | setReadOnly . . . . in     ''
    |     49 | invoke              in net.sf.hajdbc.util.reflect.Methods
    |     53 | invoke . . . . . .  in net.sf.hajdbc.invocation.SimpleInvoker
    |    145 | call                in net.sf.hajdbc.invocation.AllResultsCollector$Invocation
    
     

    Last edit: wholenewstrain 2014-03-17
  • Paul Ferraro

    Paul Ferraro - 2014-03-17

    I tried reproducing this locally, but I didn't see this failure.
    Here's what I did:

    • Create a HA-JDBC connection from 2 H2 database connections
    • Insert/commit some data
    • Deactivate one database
    • Insert/commit some more data
    • Reactivate the database
    • Call connection.setReadOnly(...)

    The last operation succeeds. I must be missing something. Can you post your test code?
    I think the key difference is that I'm deactivating the database manually, whereas you are killing the database, right?
    Question - are you doing anything with the HA-JDBC connection between the time it is deactivated and reactivated? The connection should get pruned, but I think that only happens if you try to access the connection once the database is deactivated.

     

    Last edit: Paul Ferraro 2014-03-17
  • Paul Ferraro

    Paul Ferraro - 2014-03-17

    OK - I've tracked down the issue. Essentially, the previous strategy of closing and removing connections/statements/etc. for inactive databases during invocation doesn't catch the case where a database is reactivated before any subsequent invocation.
    To fix this, I register the root proxies (i.e. DataSource, Driver, etc.) to listen for deactivation notifications from the cluster, and eagerly close and remove stale objects from the proxy tree. Please verify:
    https://github.com/ha-jdbc/ha-jdbc/commit/dd5ee81581fe12b3f5b1a9845dca15dccad67f30

     
  • wholenewstrain

    wholenewstrain - 2014-03-20

    It works better, almost there :)
    I start 3 H2 instances.

    java -cp h2-1.3.175.jar org.h2.tools.Server -tcp -tcpPort 9101 -baseDir DB1_test
    java -cp h2-1.3.175.jar org.h2.tools.Server -tcp -tcpPort 9102 -baseDir DB2_test
    java -cp h2-1.3.175.jar org.h2.tools.Server -tcp -tcpPort 9103 -baseDir DB3_test
    

    Start Grails all and list some records. Kill DB1 (Ctrl+C) on the console, DB1 JVM exits.
    I refresh the list and DB1 gets deactivated. I go to jconsole activate DB1 all good.
    Refresh the page and all good DB1 is active.
    I kill DB1 same way again. Add some data. Start up DB1. Activate DB1 all good DB1 is active. I refresh the page with list of records. Getting exception again.

    2014-03-19 20:56:07,496 [http-bio-8080-exec-3] ERROR invocation.InvokeOnManyInvocationStrategy  - Database db1 from cluster clustertest was deactivated
    Message: Database is already closed (to disable automatic closing at VM shutdown, add ";DB_CLOSE_ON_EXIT=FALSE" to the db URL) [90121-173]
        Line | Method
    ->>  331 | getJdbcSQLException in org.h2.message.DbException
    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
    |    171 | get                 in     ''
    |    148 | get . . . . . . . . in     ''
    |    137 | get                 in     ''
    |   1413 | checkClosed . . . . in org.h2.jdbc.JdbcConnection
    |   1388 | checkClosed         in     ''
    |    525 | setReadOnly . . . . in     ''
    |     49 | invoke              in net.sf.hajdbc.util.reflect.Methods
    |     53 | invoke . . . . . .  in net.sf.hajdbc.invocation.SimpleInvoker
    |    145 | call                in net.sf.hajdbc.invocation.AllResultsCollector$Invocation
    

    Using only full synchronization strategy. I tried several times. Activation works if no data changed. But it fails if there have been some records added. Once it throws 90121 after re-activated it continues to do that. So i reactivate it since after exception it has been deactivated and it throws it again and gets deactivated in the next statement. It seems like either it only works after the first activation or activation after data was modified breaks it.

     
  • Paul Ferraro

    Paul Ferraro - 2014-03-24

    Hmmm. I couldn't reproduce this issue locally. Are you using connection pooling by any chance?

     
  • wholenewstrain

    wholenewstrain - 2014-03-26

    Yes, I presume grails is using connection pooling on top of HA JDBC. Other than that my ha-jdbc-clustertest.xml is posted at the top. I will try to debug again. I wasn't able to spend any time on this recently.

     
  • wholenewstrain

    wholenewstrain - 2014-04-27

    I just wanted to let you know that I tested this scenario on the latest alpha SNAPSHOT today and it seems to be working GR8. Good job man!
    Next I will implement Dump/Restore for H2 as desribed above.
    Thanks !!

     

Log in to post a comment.