Menu

Cached Table Corruption - HSQL 2.3.2

Help
2017-05-05
2017-05-05
  • Adam Markowitz

    Adam Markowitz - 2017-05-05

    Summary:

    We have a large number of deployments in AWS of a jar file which embeds HSQL and after switching a subset of our total number of tables from Memory Tables to Cached Tables we've noticed a small percentage of these tables getting corrupted. Sometimes it is the index itself which is corrupted, sometimes it is the data within a record. Once a cached table/index is corrupted, HSQL throws exceptions if the data is touched.

    Our setup:

    • HSQL 2.3.2
    • AWS EC2 systems running ext4 as the file system
    • HSQL embedded in our jar file
    • write delay set to 0 (via connection string hsqldb.write_delay=false, verified by reviewing db.script)
    • NIO settings are the default (256MB) and some of the corrupt databases were < 256MB, some larger. So corruption is presumed to have occurred for both NIO access and non-NIO access
    • occasional hardware failures (AWS) and kill -9s occur
    • SET FILES CACHE SIZE 100000
    • SET FILES CACHE ROWS 100000
    • SET FILES DEFRAG 50
    • We have no external processes manually touching the files, but we do occasionally issue a command like BACKUP DATABASE TO 'name/db.tar.gz' NOT BLOCKING

    Symptoms:

    • Our rollout strategy for cached tables included stopping the application, issuing SET TABLE NNN TYPE CACHED along with the SET commands listed in our Setup above to HSQL, followed by a CHECKPOINT DEFRAG. The application was restarted at this point and all of our applications succeeded.
    • After a few weeks with our entire hosted fleet migrated to cached tables (a small subset of our total # of tables), upon restart of our application, a small percentage of our entire fleet failed to start as HSQL attempted to load data from a corrupted cached table on disk, which would fail.
    • The corruption was different for each of the affected databases. Some were corrupt indexes, some were corrupt UTF8 bytes, some length headers, sometimes EOF markers, etc.
    • It was noticed that looking at some of our hourly backups that the corrupted data was actually latent for quite a while, but since those records hadn't been touched, there were no failures. This may or may not discount kills and hardware failures as the culprit.

    Not all of our corruptions fit the same pattern, but here is a stack trace for one of the problems we've seen:

    Stack trace:

    SEVERE  db/db.data getFromFile failed 8016203
    org.hsqldb.HsqlException: IO error: RowInputBinary
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readString(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readChar(Unknown Source)
        at org.hsqldb.rowio.RowInputBase.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBase.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinaryDecode.readData(Unknown Source)
        at org.hsqldb.RowAVLDisk.<init>(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
        at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
        at org.hsqldb.persist.DataFileCache.get(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
        at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
        at org.hsqldb.index.NodeAVLDisk.getLeft(Unknown Source)
        at org.hsqldb.index.IndexAVL.next(Unknown Source)
        at org.hsqldb.index.IndexAVL.next(Unknown Source)
        at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.moveDataToSpace(Unknown Source)
        at org.hsqldb.persist.DataFileDefrag.writeTableToDataFile(Unknown Source)
        at org.hsqldb.persist.DataFileDefrag.process(Unknown Source)
        at org.hsqldb.persist.DataFileCache.defrag(Unknown Source)
        at org.hsqldb.persist.Log.defrag(Unknown Source)
        at org.hsqldb.persist.Log.checkpoint(Unknown Source)
        at org.hsqldb.persist.Logger.checkpointInternal(Unknown Source)
        at org.hsqldb.persist.Logger.checkpoint(Unknown Source)
        at org.hsqldb.StatementCommand.getResult(Unknown Source)
        at org.hsqldb.StatementCommand.execute(Unknown Source)
        at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
        at org.hsqldb.Session.executeDirectStatement(Unknown Source)
        at org.hsqldb.Session.execute(Unknown Source)
        at org.hsqldb.jdbc.JDBCStatement.fetchResult(Unknown Source)
        at org.hsqldb.jdbc.JDBCStatement.execute(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.processSQL(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.scanpass(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.execute(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.objectMain(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.main(Unknown Source)
    Caused by: java.io.UTFDataFormatException
        at org.hsqldb.lib.StringConverter.readUTF(Unknown Source)
        at org.hsqldb.lib.StringConverter.readUTF(Unknown Source)
        ... 37 more
    
    2017-04-24T17:05:22.679-0700  SEVERE  defrag failed 
    org.hsqldb.HsqlException: IO error: RowInputBinary
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readString(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readChar(Unknown Source)
        at org.hsqldb.rowio.RowInputBase.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBase.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinary.readData(Unknown Source)
        at org.hsqldb.rowio.RowInputBinaryDecode.readData(Unknown Source)
        at org.hsqldb.RowAVLDisk.<init>(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
        at org.hsqldb.persist.DataFileCache.getFromFile(Unknown Source)
        at org.hsqldb.persist.DataFileCache.get(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.get(Unknown Source)
        at org.hsqldb.index.NodeAVLDisk.findNode(Unknown Source)
        at org.hsqldb.index.NodeAVLDisk.getLeft(Unknown Source)
        at org.hsqldb.index.IndexAVL.next(Unknown Source)
        at org.hsqldb.index.IndexAVL.next(Unknown Source)
        at org.hsqldb.index.IndexAVL$IndexRowIterator.getNextRow(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLDisk.moveDataToSpace(Unknown Source)
        at org.hsqldb.persist.DataFileDefrag.writeTableToDataFile(Unknown Source)
        at org.hsqldb.persist.DataFileDefrag.process(Unknown Source)
        at org.hsqldb.persist.DataFileCache.defrag(Unknown Source)
        at org.hsqldb.persist.Log.defrag(Unknown Source)
        at org.hsqldb.persist.Log.checkpoint(Unknown Source)
        at org.hsqldb.persist.Logger.checkpointInternal(Unknown Source)
        at org.hsqldb.persist.Logger.checkpoint(Unknown Source)
        at org.hsqldb.StatementCommand.getResult(Unknown Source)
        at org.hsqldb.StatementCommand.execute(Unknown Source)
        at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
        at org.hsqldb.Session.executeDirectStatement(Unknown Source)
        at org.hsqldb.Session.execute(Unknown Source)
        at org.hsqldb.jdbc.JDBCStatement.fetchResult(Unknown Source)
        at org.hsqldb.jdbc.JDBCStatement.execute(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.processSQL(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.scanpass(Unknown Source)
        at org.hsqldb.cmdline.SqlFile.execute(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.objectMain(Unknown Source)
        at org.hsqldb.cmdline.SqlTool.main(Unknown Source)
    Caused by: java.io.UTFDataFormatException
        at org.hsqldb.lib.StringConverter.readUTF(Unknown Source)
        at org.hsqldb.lib.StringConverter.readUTF(Unknown Source)
        ... 37 more
    

    Corrupt Data:
    Here are the corrupt bytes from one of our examples and, as you can see, they are definitely not in valid UTF8 encoding. The skipping corrupt record section are the actual byte values and the filling with section is the ascii representation of the bytes (this record is expected to have no multibyte UTF8 character data, only bytes < ordinal 128).

    skipping corrupt record:

    7b 22 65 76     65 6e 74 5f     66 61 63 74     73 2e 52 65 
    63 6f 6d 6d     65 6e 64 61     74 69 6f 6e     5f 31 22 3a 
    22 72 72 25     22 2c 22 65     76 65 6e 74     5f 66 61 63 
    74 73 2e 52     65 63 6f 6d     6d 65 6e 64     61 74 69 6f 
    6e 5f 32 22     3a 22 22 2c     22 76 69 73     69 74 5f 66 
    61 63 74 73     2e 64 65 76     69 63 65 5f     74 79 00 00 
    00 e0 00 00     00 00 00 00     24 70 00 0a     b1 92 00 00 
    1e bc ff ff     ff ff 00 7a     4b 25 00 7a     52 4e 00 7a 
    42 61 ff ff     ff ff 00 7a     50 1d 00 7a     52 1b 00 7a 
    4b 25 00 00     00 01 00 7a     3e c8 00 7a     d6 0d 00 7b 
    1f f9 00 00     00 00 00 7a     50 1d 00 7a     52 4e 00 7a 
    4b 25 00 00     00 00 00 00     00 00 00 00     00 00 00 b3 
    8f 09 00 00     00 00 00 00     00 00 00 00     00 00 00 0a 
    b1 54 66 61     63 74 73 2e     73 65 73 73     69 6f 6e 5f 
    73 74 61 72     74 5f 64 61     74 65 22 3a     22 37 20 64 
    61 79 73 20     61 67 6f 20     66 6f 72 20     37 20 64 61 
    79 73 22 2c     22 76 69 73     69 74 5f 66     61 63 74 73 
    2e 75 74 6d     5f 63 6f 6e     74 65 6e 74     22 3a 22 22 
    7d 
    

    filling with: (The ?s are used to replace the invalid bytes from above so we can see what is corrupt)

    { " e v     e n t _     f a c t     s . R e 
    c o m m     e n d a     t i o n     _ 1 " : 
    " r r %     " , " e     v e n t     _ f a c 
    t s . R     e c o m     m e n d     a t i o 
    n _ 2 "     : " " ,     " v i s     i t _ f 
    a c t s     . d e v     i c e _     t y ? ?
    ? ? ? ?     ? ? ? ?     $ p ? ?     ? ? ? ? 
    ? ? ? ?     ? ? ? z     K % ? z     R N ? z 
    B a ? ?     ? ? ? z     P ? ? z     R ? ? z 
    K % ? ?     ? ? ? z     > ? ? z     ? ? ? { 
    ? ? ? ?     ? ? ? z     P ? ? z     R N ? z 
    K % ? ?     ? ? ? ?     ? ? ? ?     ? ? ? ? 
    ? ? ? ?     ? ? ? ?     ? ? ? ?     ? ? ? ? 
    ? T f a     c t s .     s e s s     i o n _ 
    s t a r     t _ d a     t e " :     " 7   d 
    a y s       a g o       f o r       7   d a 
    y s " ,     " v i s     i t _ f     a c t s 
    . u t m     _ c o n     t e n t     " : " " 
    } 
    

    Questions:

    1. There seemed to be an indication from the 2.3.3/2.3.4 release notes that some persistence issues with cached tables were resolved. Where can we find more information on what was resolved and what the symptoms were? Do you anticipate that this issue is resolved in 2.3.4+?
    2. We tried various continuous tests suites with 1,000s of iterations to randomly kill -9 HSQL while doing many concurrent CRUD operations on HSQL Cached Tables, but were unable to replicate any corruption with 2.3.2 in this controlled environment. Are there known circumstances with 2.3.2 which caused data corruption in Cached Tables and if so, what are they?
     

    Last edit: Adam Markowitz 2017-05-05
  • Fred Toussi

    Fred Toussi - 2017-05-05
    1. Please see http://hsqldb.org/web/supportware.html "Premier Corporate subscription ... for corporation with extensive use or distribution of HyperSQL." This allows us to develop and improve persistence as well as other capabilities.
    2. The best approach is to use version 2.4.0 (or 2.3.5 for Java 6 and 7)
    3. There were several issues that were resolved or improved by changes in locking or other changes since 2.3.2. For example, CHECKPOINT DEFRAG would cause problems with tables that had both PK and non-PK indexes. In any case, you don't need to perform this after changing tables to CACHED.
     

    Last edit: Fred Toussi 2017-05-05
  • Adam Markowitz

    Adam Markowitz - 2017-05-05
    1. Thanks! We will look into it.
    2. We are currently testing against 2.3.4, but will put upgrading to 2.4.0 on our list after 2.3.4.
    3. "resolved or improved"... To clarify, does this mean that there were corruption symptoms, similar to what I outlined above, that were resolved in 2.3.3+? When you mention cause problems, can you clarify what the symptoms were? corruption? Something else?

    Thanks!,
    -Adam

     
  • Fred Toussi

    Fred Toussi - 2017-05-05
    1. test with 2.4.0 instead as it improves on 2.3.4
    2. yes, all the symptoms that you mentioned have been seen sporadically and reduced or eliminated in customer tests and deployments
     

    Last edit: Fred Toussi 2017-05-05

Log in to post a comment.