I am currently using jtds for a couple of jobs involving a (relatively) great number of updates, so I've been particularly careful on using jdbc features related with batching and group fetching (the spring framework really helps, with this).
Some time ago I developed a benchmark application which I ran over three databases, and jtds seemed to perform fairly well: the transfer of various tens of thousands of lines ran in a significantly lesser portion of the time that the ms jdbc driver required (almost half of it), and the performance was acceptable to us.
But it turns out that the connection speed towards those databases was much faster than the one we have in production now, and with the same benchmark both jtds and the ms driver show now very slow performance. In this scenario the ms driver can be even 20 times slower than datadirect one: in the same situation jtds works a little bit better, but takes 15 times the time datadirect's driver needs, just the same.
For a single transfer (we need to do almost 20 of them) over a "relatively" fast connection (i.e. not the worst one), the aggregate times are as follows:
jtds 0.9: 40897 rows inserted in 524362ms, 12.8215275ms per row
ms jdbc: 40897 rows inserted in 692571ms, 16.934519ms per row
datadirect: 40897 rows inserted in 44328ms, 1.0838937ms per row
My question is: has anything changed in the batch implementation from jtds 0.8 to 0.9? Such a difference seems enormous, to me. I didn't have the time to look at the source code but a quick sniffing with ethereal seems to show that the driver does not do a real batching, but sends one line after the other (or at least the server answers after every line). Is this an optimization problem or maybe I missed some switch? Is there something I could to in order to help optimizing batch updates?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nothing has changed in the batch implementation performance-wise from 0.8 to 0.9. It's just that batch updates were never implemented as they should have been.
In order to obtain the kind of performance you see with the DataDirect driver, queries have to be queued and sent all at once, when executeBatch() is called, in one single request. Instead, in our implementation they are indeed sent only when executeBatch() is called, but each one in a separate request and the response has to be read before sending the next request. This means that with a batch of 100 queries, 100 (admittedly small) request-response operations have to be made instead of one large request and one large response. This means that if the network latency is large (i.e. it takes a lot of time for a request to reach the server and the response to come back) performance will be greatly affected.
As an example, let's say network latency is about 10 milliseconds in your case. This means that with jTDS client plus server processing takes 2.8 milliseconds per query, with the MS driver it takes 6.9 milliseconds (this is the real performance difference between jTDS and MS), while with the DataDirect driver latency doesn't affect performance very much, because those 10 milliseconds are divided between 100 queries so the overhead introduced by network latency is of about 0.1 millisecond per query. This is what batch updates tries to avoid.
Now jTDS is designed to only build and send packets consisting of single queries so it will need some work in order to be able to correctly support batch updates. We are going to do this when time allows, but right now we're trying to work around some bugs introduced by new character set handling in SQL Server 2000 (which we have just began to completely support in release 0.9). If you have the time and want to, you could try to implement the new functionality yourself and submit it as a patch or just look through the code (it's very well structured and documented) and suggest how it could be done. Otherwise I'll let you know when I start working on it.
Alin.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, I see. The only problem is that I do not know anything about sql server protocol! Is there any resource I could read in order to have an idea of what could be needed to send batch updates?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's not complicated at all from a protocol point of view. Basically there are two levels in the TDS protocol: a low, network level and a higher communication level. You can find (almost) all the information you need on the FreeTDS and jTDS web pages: http://www.freetds.org/tds.html and http://jtds.sourceforge.net/TDS.html
There's even more bits of information to be found if you search for "TDS" with Google.
I might be wrong here but the basic idea is that you can send as many high level requests in a single network level packet, i.e. all one needs to do is to not build and send a network packet after the first query, but rather queue all queries sending "not last" network packets until the next to last one and then in the end sending one "last" network packet. The server will then execute all queued requests and return the results just like the queries were all part of a single big query (i.e. a large concatenated string). I think that's all it is to it (I might be wrong with the details but this is the big picture).
I've seen this behavior using our very basic protocol monitoring tools (from the tools package) and you can use them too. Just get a client that you know supports batch updates, start our TDS proxy and direct the client to the proxy. It will log all communication and you will be able to see the details. BTW, a sample dump could help anyone who's going to implement this in the end.
Hope this helps,
Alin.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah, wonderful, I didn't know you have a protocol dissector. These days I'm furiously busy putting in production our new system (and I'm afraid we'll have to buy one of those 3200$ dd licenses... :|), but as soon as I'll have some rest I'll give an eye to this stuff: as you say it doesn't seem too complicate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
FYI, on this same batch update issue, today I've tried all other commercial drivers I'm aware of. It seems that INET Merlia and NewAtlanta JTurbo both "fake" it exactly as jtds (and microsoft jdbc), only with performance much worse! The only other driver that states supporting "correct" batch updates, Jnetdirect JSQLConnect... was not able to execute my test, giving an obscure exception about prepared statement parameters not set (same exact code all other drivers were able to perform...). I think I'll try to take a look at the implementation soon: too bad we're pressed with a solution now.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think you might want to take another look at the i-net driver. As far as I know it actually implements batch updates correctly, maybe it needs a URL parameter for that, not sure. And according to my experience it's the next fastest driver after jTDS.
Alin.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually it's more than that: it's 3200, and it's a per-CPU license. I will soon explore other alternatives (inetdirect, aveconnect, and so on): too bad I work for a pretty traditionalist company, I'd like we could direclty fund jtds development and help you financially to improve the driver. I'll see what we can do in other ways.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am currently using jtds for a couple of jobs involving a (relatively) great number of updates, so I've been particularly careful on using jdbc features related with batching and group fetching (the spring framework really helps, with this).
Some time ago I developed a benchmark application which I ran over three databases, and jtds seemed to perform fairly well: the transfer of various tens of thousands of lines ran in a significantly lesser portion of the time that the ms jdbc driver required (almost half of it), and the performance was acceptable to us.
But it turns out that the connection speed towards those databases was much faster than the one we have in production now, and with the same benchmark both jtds and the ms driver show now very slow performance. In this scenario the ms driver can be even 20 times slower than datadirect one: in the same situation jtds works a little bit better, but takes 15 times the time datadirect's driver needs, just the same.
For a single transfer (we need to do almost 20 of them) over a "relatively" fast connection (i.e. not the worst one), the aggregate times are as follows:
jtds 0.9: 40897 rows inserted in 524362ms, 12.8215275ms per row
ms jdbc: 40897 rows inserted in 692571ms, 16.934519ms per row
datadirect: 40897 rows inserted in 44328ms, 1.0838937ms per row
My question is: has anything changed in the batch implementation from jtds 0.8 to 0.9? Such a difference seems enormous, to me. I didn't have the time to look at the source code but a quick sniffing with ethereal seems to show that the driver does not do a real batching, but sends one line after the other (or at least the server answers after every line). Is this an optimization problem or maybe I missed some switch? Is there something I could to in order to help optimizing batch updates?
Davide,
Nothing has changed in the batch implementation performance-wise from 0.8 to 0.9. It's just that batch updates were never implemented as they should have been.
In order to obtain the kind of performance you see with the DataDirect driver, queries have to be queued and sent all at once, when executeBatch() is called, in one single request. Instead, in our implementation they are indeed sent only when executeBatch() is called, but each one in a separate request and the response has to be read before sending the next request. This means that with a batch of 100 queries, 100 (admittedly small) request-response operations have to be made instead of one large request and one large response. This means that if the network latency is large (i.e. it takes a lot of time for a request to reach the server and the response to come back) performance will be greatly affected.
As an example, let's say network latency is about 10 milliseconds in your case. This means that with jTDS client plus server processing takes 2.8 milliseconds per query, with the MS driver it takes 6.9 milliseconds (this is the real performance difference between jTDS and MS), while with the DataDirect driver latency doesn't affect performance very much, because those 10 milliseconds are divided between 100 queries so the overhead introduced by network latency is of about 0.1 millisecond per query. This is what batch updates tries to avoid.
Now jTDS is designed to only build and send packets consisting of single queries so it will need some work in order to be able to correctly support batch updates. We are going to do this when time allows, but right now we're trying to work around some bugs introduced by new character set handling in SQL Server 2000 (which we have just began to completely support in release 0.9). If you have the time and want to, you could try to implement the new functionality yourself and submit it as a patch or just look through the code (it's very well structured and documented) and suggest how it could be done. Otherwise I'll let you know when I start working on it.
Alin.
Ok, I see. The only problem is that I do not know anything about sql server protocol! Is there any resource I could read in order to have an idea of what could be needed to send batch updates?
Davide,
It's not complicated at all from a protocol point of view. Basically there are two levels in the TDS protocol: a low, network level and a higher communication level. You can find (almost) all the information you need on the FreeTDS and jTDS web pages: http://www.freetds.org/tds.html and http://jtds.sourceforge.net/TDS.html
There's even more bits of information to be found if you search for "TDS" with Google.
I might be wrong here but the basic idea is that you can send as many high level requests in a single network level packet, i.e. all one needs to do is to not build and send a network packet after the first query, but rather queue all queries sending "not last" network packets until the next to last one and then in the end sending one "last" network packet. The server will then execute all queued requests and return the results just like the queries were all part of a single big query (i.e. a large concatenated string). I think that's all it is to it (I might be wrong with the details but this is the big picture).
I've seen this behavior using our very basic protocol monitoring tools (from the tools package) and you can use them too. Just get a client that you know supports batch updates, start our TDS proxy and direct the client to the proxy. It will log all communication and you will be able to see the details. BTW, a sample dump could help anyone who's going to implement this in the end.
Hope this helps,
Alin.
Ah, wonderful, I didn't know you have a protocol dissector. These days I'm furiously busy putting in production our new system (and I'm afraid we'll have to buy one of those 3200$ dd licenses... :|), but as soon as I'll have some rest I'll give an eye to this stuff: as you say it doesn't seem too complicate.
Davide,
I'm sorry to hear you'll have to spend that much, but unfortunately I have to agree that this something that jTDS needs more work on.
So, as soon as you have the time to look into it, let us know what you think.
Alin.
FYI, on this same batch update issue, today I've tried all other commercial drivers I'm aware of. It seems that INET Merlia and NewAtlanta JTurbo both "fake" it exactly as jtds (and microsoft jdbc), only with performance much worse! The only other driver that states supporting "correct" batch updates, Jnetdirect JSQLConnect... was not able to execute my test, giving an obscure exception about prepared statement parameters not set (same exact code all other drivers were able to perform...). I think I'll try to take a look at the implementation soon: too bad we're pressed with a solution now.
Davide,
I think you might want to take another look at the i-net driver. As far as I know it actually implements batch updates correctly, maybe it needs a URL parameter for that, not sure. And according to my experience it's the next fastest driver after jTDS.
Alin.
Actually it's more than that: it's 3200, and it's a per-CPU license. I will soon explore other alternatives (inetdirect, aveconnect, and so on): too bad I work for a pretty traditionalist company, I'd like we could direclty fund jtds development and help you financially to improve the driver. I'll see what we can do in other ways.