Menu

#401 Column metadata being retrieved from database multiple times

v2.5.*
open-accepted
None
(not fixed)
5
2019-03-22
2017-06-22
Joe Kearns
No

ResultSetTableMetaData.createMetaData loops through all of the columns in the table and tries to configure them using, by preference,createColumnFromDbMetaData. The problem is that createColumnFromDbMetaData closes the ResultSetMetaData every time, forcing it to be reloaded from the database for the next column.

This is a serious performance issue for "wide" tables. I worked around it in my own code by implementing a caching MetadataHandler and the execution time for the test method went from 30 minutes down to under a minute. It made all the difference between DBUnit being a viable test solution for us or not.

I would like to fix the issue properly in ResultSetTableMetaData itself, probably by loading the metadata into a collection before looping through the columns. However I'm conscious of being a DBUnit newbie, so before I start changing code, does anyone know of a reason why it shouldn't be changed?

Related

Feature Requests: #242

Discussion

  • Jeff Jensen

    Jeff Jensen - 2017-06-23

    I'm not familiar with that code and don't know of a reason not to improve it.

    I'm glad you have found this performance issue and are improving it. I'm also interested if the change affects any tests.

    I'm curious, how many columns do these "wide" tables have?

     
  • Jeff Jensen

    Jeff Jensen - 2017-06-23
    • assigned_to: Jeff Jensen
     
  • Joe Kearns

    Joe Kearns - 2017-06-23

    The case in point is a data warehouse table on Teradata with about 200 columns (not my design, if you're wondering). The test loads a validation data set from an Excel spreadsheet, and when it came to finding metadata for that it was making ~200 separate calls to the database, each taking 8-10 seconds.

    I'll have a look at the tests before changing anything, and see if there are any meaningful before-and-after comparisons I can do.

     
  • Joe Kearns

    Joe Kearns - 2017-06-26

    I have a fix for this in my fork (u/jkearns). Should I go ahead and create a merge request or does someone need to look at it before that?

    Regarding the tests, I was seeing 21 failures before making my changes and still the same after. I added a couple of tests for the changes, which are passing.

    The 21 failing tests are actually all in just four distinct failure points, each with multiple hits:
    BiDirectionalEdgesDepthFirstSearchTest.testDisconnectedCycles
    DepthFirstSearchTest.testDisconnectedCycles
    DatabaseSequenceFilterTest.testGetTableNamesCyclic
    XlsTableTest.testDifferentDatatypes
    I don't fully understand what's going on with the cyclic dependency search tests, but problem with the Xls testDifferentDatatypes seems to be related to daylight-saving time.

     
  • Jeff Jensen

    Jeff Jensen - 2017-06-27

    Thanks for the update, glad you made progress! Obviously, this feature significantly helps large tables such as your situation, and I'm curious if there is a detectable impact for smaller ones and at what size.

    Regarding commits, I reviewed them (not in detail yet) and think the changes are good. I noted these minor items to help in the future and I hope you will update for them.

    • Please squash commits where possible to combine into a final file(s) change, and keep any separate for the commit stages as you want (smaller organized commits are best but subsequent fix commits are best squashed to the prior). This helps my review and the code history after the merge.
    • Looks like a couple of commits had formatting changes? Appears most of the file changed without lines changing, which usually means tabs vs spaces or NL vs CRLF (I didn't verify). This affects diffs and is more difficult to review, as well as makes inconsistent files. I'm wondering what your settings are. Can fix those and squash too.
    • For code mainenance and readability, some code could use a little refactoring - some blocks with multiple nested blocks, a couple of methods with repeated gets, etc. Extracting methods enables focused unit tests on those critical sections and helps code mainenance and readability, especially if it's not you updating it in the future! Squash these updates into prior commit(s) too.
    • Please also check the code coverage for changed and new classes needing tests. This feature is at the core for metadata and can easily break things. More good tests help minimize the risks.

    For test failures, how are you running the build? Which Java version, Java 7 or 8? Please see [feature-requests:#195], can't use Java 8. I'm using Java 7 for all dbUnit builds and have no problems (same for the CI server).

     

    Related

    Feature Requests: #195

  • Joe Kearns

    Joe Kearns - 2017-07-20

    Apologies, Jeff, for taking so long to get back to this. I think I've figured out the formatting issues and I will squash when I commit the changes following on from your feedback. I will also look for areas where test coverage can usefully be extended.

    For my guidance, could you please let me know where you think refactoring is needed? I have extracted a nested block in ResultSetTableMetaData.getAllColumnMetaData but I'm aware that there are others in existing code that I altered, and I was wary of making arbitrary changes to other people's code. One thing that contributes to block nesting in these classes is the recurrent pattern of using try...finally to ensure that connections are always closed.

    Am I right in thinking that "repeated gets" is referring to how ColumnMetaData is accessed in SQLHelper.createColumn? I'm thinking of extracting the two object-builds in that method as well, perhaps adding a new constructor for the Column class. Ideally, I would very much like to add a new method to IDataTypeFactory to create a DataType from a ColumnMetaData, but that would have consequences for any third parties with their own implementations of the interface.

    I had, as you suspected, been using Java 8 and most of my test problems disappeared on switching to 7. I do still get failures on XlsTableTest.testDifferentDatatypes, however, and I think it's something to do with daylight saving in my time zone (UK/BST). I'll try and get to the bottom of that if I can find the time.

    You ask at what size this bug would have an impact on performance, and the simple answer is that it starts to have an impact as soon as you have more than one column in a query-result data set, and it scales up linearly with the number of columns. The point at which the impact becomes painful depends on the speed of your systems. In my test environment it takes about 8-10 seconds to connect to the database and pull back the metadata, so it doesn't take many columns to make that very tiresome indeed. 6 columns ~ 1 minute; 20 columns ~ 3 minutes; 200 columns ~ 30 minutes. With the issue corrected it comes down to the flat 8-10 seconds overhead and the number of columns isn't really a consideration.

    I haven't been able to simulate latency in the HSQL unit-test database, but in the end I'm not convinced that would be worth doing. Performance-testing and unit-testing aren't the same thing, after all.

     
    • Jeff Jensen

      Jeff Jensen - 2017-07-23

      Sounds good Joe, really appreciate your work on this improvement.

      please let me know where you think refactoring is needed?
      recurrent pattern of using try...finally

      Thanks for reviewing and improving. The one you changed was the main one that stood out. Squashed commits will be easier for me to check again too. Agreed on the try...finally!

      Am I right in thinking that "repeated gets" is referring to how ColumnMetaData is accessed in SQLHelper.createColumn?

      No, that's not it, it has to do with repeated code (vs compound statements - which can also be unclear in maintenance), "getting" the same object/value multiple times in a method, usually in a chained/fluent-type syntax. I am not finding what I have a memory of. Thanks for asking.

      I'm thinking of extracting the two object-builds in that method as well, perhaps adding a new constructor for the Column class.

      If it simplifies code, makes methods smaller, I'm for it!

      Ideally, I would very much like to add a new method to IDataTypeFactory to create a DataType from a ColumnMetaData, but that would have consequences for any third parties with their own implementations of the interface.

      Yes, "breaking change", and we don't have a solid way to assess impact!
      If we can fix things and move to Java 8, we can use default methods....

       
  • Jeff Jensen

    Jeff Jensen - 2017-08-19

    Hi Joe, just wondering about this change and how things are with it. Anything remaining to do, can I help with something?

     
  • Jeff Jensen

    Jeff Jensen - 2017-08-19
    • status: open --> pending
     
  • Jeff Jensen

    Jeff Jensen - 2017-09-18

    Joe, I want to update dbUnit with your changes. Are you able to commit your latest changes you mentioned, squash into 1 commit, and create a merge request please? If so, it will make the code changes easier and cleaner to add.

     
  • Jeff Jensen

    Jeff Jensen - 2017-10-15
    • status: pending --> closed-accepted
    • Fixed Release: (not fixed) --> 2.5.5
     
  • Jeff Jensen

    Jeff Jensen - 2017-10-15

    Thank you Joe for the merge request!

     
  • Joe Kearns

    Joe Kearns - 2017-10-17

    My pleasure, Jeff. Let me know if there's anything else I can pitch in with.

     
  • Jeff Jensen

    Jeff Jensen - 2017-12-02
    • status: closed-accepted --> open-accepted
    • Fixed Release: 2.6.0 --> (not fixed)
     
  • Jeff Jensen

    Jeff Jensen - 2017-12-02

    The implementation caused Oracle and MS SQLServer failures. Reverted changes and reopening this ticket.

     
  • Jeff Jensen

    Jeff Jensen - 2017-12-03

    This is the exception encountered with Oracle. Note: it only occurs on a non-localhost database server; running localhost does not exhibit the problem:

    Caused by: java.sql.SQLException: Stream has already been closed
    at oracle.jdbc.driver.LongAccessor.getBytesInternal(LongAccessor.java:131)
    at oracle.jdbc.driver.Accessor.getBytes(Accessor.java:979)
    at oracle.jdbc.driver.LongAccessor.getString(LongAccessor.java:166)
    at oracle.jdbc.driver.GeneratedStatement.getString(GeneratedStatement.java:327)
    at oracle.jdbc.driver.GeneratedScrollableResultSet.getString(GeneratedScrollableResultSet.java:973)
    at org.dbunit.dataset.ColumnMetaData$StringMetaDataField.setCheckedValue(ColumnMetaData.java:497)
    at org.dbunit.dataset.ColumnMetaData$MetaDataField.setValue(ColumnMetaData.java:445)
    at org.dbunit.dataset.ColumnMetaData.<init>(ColumnMetaData.java:90)
    at org.dbunit.util.SQLHelper.createColumn(SQLHelper.java:475)
    at org.dbunit.database.DatabaseTableMetaData.getColumns(DatabaseTableMetaData.java:333)
    

    This blog post suggests a potential cause:
    Reading LONG or LONG RAW from JDBC causes a “Stream has already been closed” exception:
    https://blog.jooq.org/2015/12/30/oracle-long-and-long-raw-causing-stream-has-already-been-closed-exception/

    Assuming the blog post's noted cause is the same for dbUnit, we need to add metadata caching feature so doesn't break Oracle and MSSQLServer. The first implementation changed how how dbUnit reads the ResultSet; probably need to leave that as is and cache around that.

     
  • Jeff Jensen

    Jeff Jensen - 2019-03-22

    Hi Joe, wondering if you have interest in making this work, since it is such a good performance improvement for wide tables?

     

Log in to post a comment.