Thread: [Aoetools-discuss] GGAoEd request merging [was:GGAoEd - initial evaluation feedback]
Brought to you by:
ecashin,
elcapitansam
From: Delian K. <kr...@kr...> - 2010-01-08 13:27:42
|
Hi, >>On Wed, 6 Jan 2010 16:17:18 +0100 Gabor Gombas wrote: >>> On Tue, Jan 05, 2010 at 07:04:31PM +0200, Delian Krustev wrote: >>> >>> > 1. I browsed the project in google code and downloaded your 1.1 distribution >>> > then. It was missing the handy "debian" directory which I've met in the >>> > SVN repository. I don't know why is that but it might be handy to do package it. >>> > At least for debian users like me. >>> > (I've checked out the tagged version and build the package from there ) >>> >>> Well, Debian developers currently disagree about if it is a good idea to >>> include the "debian" directory in the upstream tarball or not. >> >>If the package is included in Debian you might stop distributing it and use >>the debian package management facilities to manage the packaging part. >> >>But since it's not for now I thought it might be useful for others too. >>Anyway, this was just a suggestion. >> >>> > 2. You might want to mention in the README the build dependency to libblkid-dev. >>> > I didn't have it at first and the configuration step failed. >>> >>> Thanks, I'll add that. >>> >>> > P.S. The motivation to test ggaoed is write performance issue I've faced with >>> > vblade. In case you're interested you might look at: >>> > >>> > http://krustev.net/w/articles/Backup_service_and_software_block_devices_over_the_net/ >>> >>> With ggaoed I expect you get much more even performance, since it uses >>> direct I/O by default, and therefore avoids the read/modify/write cycles >>> you get when using buffered I/O (like vblade does) and an MTU smaller >>> than the page size. >> >>Unfortunately this is not the case. >>I've identified the bottleneck being too much IO transactions when using >>either vblade or ggaoed. > >IMHO the lack of jumbo frames is biting you. That is for sure. >Can't you borrow two jumbo-capable NICs for testing? Unfortunately no. The servers are sitting in a data centre in a different country and changes on the hardware specification could not be easily done. The solution that I need to implement is for this DC, so I need to find a reasonable option. >If the MTU is 1500, you have 2 sectors >per request. If the MTU is 9000, you can have 17 sectors per request - >that's a more than 8 times reduction in the number of I/O operations >you're sending to the disk. Yep. This is why I did hope to get the request merging working. To illustrate the numbers, first the local test: # dd if=/dev/zero of=/dev/mapper/vg0-nbd6.0 bs=1M count=1000 seek=100 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 8.18595 s, 128 MB/s At the same time on the nearby console iostat shows: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 147.00 4.00 73420.00 4 73420 sdb 141.00 8.00 70724.00 8 70724 md8 38662.00 0.00 154648.00 0 154648 dm-2 38662.00 0.00 154648.00 0 154648 The physical devices (sda/sdb) do about 1/2 MB per transfer operation. Then do the AoE test: # dd if=/dev/zero of=/dev/etherd/e6.0 bs=1M count=100 seek=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 4.36242 s, 24.0 MB/s And the iostat results: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 2962.00 0.00 13145.00 0 13145 sdb 2918.00 0.00 12971.00 0 12971 md8 7658.00 0.00 25364.00 0 25364 dm-2 7658.00 0.00 25364.00 0 25364 So this time sda&sdb do about 4 KB per transfer operation. # ggaoectl stats # Statistics for device nbd6.0 read_cnt: 504 read_bytes: 516096 read_time: 4.79039 write_cnt: 204800 write_bytes: 209715200 write_time: 103.907 other_cnt: 34 other_time: 0.00017897 io_slots: 58789 io_runs: 58789 queue_length: 2128789 queue_stall: 0 queue_over: 0 ata_err: 0 proto_err: 0 # Statistics for interface eth1 rx_cnt: 205337 rx_bytes: 217120220 rx_runs: 58822 rx_buffers_full: 0 tx_cnt: 205338 tx_bytes: 12832088 tx_runs: 0 tx_buffers_full: 0 dropped: 0 ignored: 0 broadcast: 11 >>Then I decided to play with ggaoed settings to see if I could get this feature >>working: >> >>> Request merging: read/write requests for adjacent data blocks can >>> be submitted as a single I/O request >> >>> You can also use "ggaoectl stats" and "ggaoectl monitor" to see how >>> things are going; ggaoed has quite a bit more knobs to tune than vblade. >> >>So I've tried various values for the (what I've thought were) related parameters: >> >>queue-length >>max-delay >>merge-delay >> >>( and the other params too ) >> >>The number of IO operations was always too high for a decent performance >>on a real block device. So I guess the request merging was just not working >>for some case. I was not able to get more than 30 MB > >You can check the output of "ggaoectl stats": for every exported device, >the (read_cnt + write_cnt) / io_slots ratio gives how many requests >could be merged on average. From the numbers above: ( 504 + 204800 ) / 58789 = 3.49 Here goes my config: # egrep -v '^(#|$)' /etc/ggaoed.conf [defaults] queue-length = 16 interfaces = eth1 direct-io = true pid-file = /var/run/ggaoed.pid control-socket = /var/run/ggaoed.sock state-directory = /var/lib/ggaoed ring-buffer-size = 0 send-buffer-size = 1024 receive-buffer-size = 1024 [acls] [nbd6.0] path = /dev/mapper/vg0-nbd6.0 shelf = 6 slot = 0 Please let me know if you have some comments on the numbers. >> >>Disabling direct-io actually increased the performance in my case. >> >>I've also tried exporting an in-memory file, and this test easily utilized around >>900 Mbits of bandwidth. >> >> >>P.S. I was not able to find a public discussion board(a mailing list?) for >>your project. Otherwise I would have posted there since the discussed >>information is not private in anyway and could be of interest for others. >> > >IMHO you can use the aoe...@li... list, >especially if you're also testing vblade. The volume is quite low so I >think there is no need to create a separate list for ggaoed. Thanks. I'll post there. I could conclude that I'm hitting a protocol limitation which you're trying to workaround with GGAoEd (request merging) Cheers -- Delian |
From: Sam H. <sa...@co...> - 2010-01-08 14:59:32
|
> # Statistics for interface eth1 > rx_cnt: 205337 > rx_bytes: 217120220 > rx_runs: 58822 > rx_buffers_full: 0 > tx_cnt: 205338 > tx_bytes: 12832088 > tx_runs: 0 > tx_buffers_full: 0 > dropped: 0 > ignored: 0 > broadcast: 11 If rx_runs is an overrun condition (I'd guess), then you probably have a flow control problem with packets getting dropped at the nic. Sam |
From: Gabor G. <go...@sz...> - 2010-01-15 09:37:28
|
On Fri, Jan 08, 2010 at 09:42:50AM -0500, Sam Hopkins wrote: > > # Statistics for interface eth1 > > rx_cnt: 205337 > > rx_bytes: 217120220 > > rx_runs: 58822 > > rx_buffers_full: 0 > > tx_cnt: 205338 > > tx_bytes: 12832088 > > tx_runs: 0 > > tx_buffers_full: 0 > > dropped: 0 > > ignored: 0 > > broadcast: 11 > > If rx_runs is an overrun condition (I'd guess), then you probably have > a flow control problem with packets getting dropped at the nic. No, rx_runs is roughly the number of times the daemon got woken up by the kernel due to input data being available on the network socket. So compared with rx_cnt, it says that the kernel queued roughly 3.5 packets on average before ggaoed got to process them. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- |
From: Gabor G. <go...@sz...> - 2010-01-08 15:50:53
|
On Fri, Jan 08, 2010 at 02:27:30PM +0200, Delian Krustev wrote: > To illustrate the numbers, first the local test: > > # dd if=/dev/zero of=/dev/mapper/vg0-nbd6.0 bs=1M count=1000 seek=100 > 1000+0 records in > 1000+0 records out > 1048576000 bytes (1.0 GB) copied, 8.18595 s, 128 MB/s > > At the same time on the nearby console iostat shows: > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 147.00 4.00 73420.00 4 73420 > sdb 141.00 8.00 70724.00 8 70724 > md8 38662.00 0.00 154648.00 0 154648 > dm-2 38662.00 0.00 154648.00 0 154648 > > The physical devices (sda/sdb) do about 1/2 MB per transfer operation. Yes, that's the default maximum request size for the disks, and since "dd" gives nice big 1M requests to the kernel, it can easily submit such large requests to the disks. > Then do the AoE test: > > # dd if=/dev/zero of=/dev/etherd/e6.0 bs=1M count=100 seek=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 4.36242 s, 24.0 MB/s > > And the iostat results: > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 2962.00 0.00 13145.00 0 13145 > sdb 2918.00 0.00 12971.00 0 12971 > md8 7658.00 0.00 25364.00 0 25364 > dm-2 7658.00 0.00 25364.00 0 25364 > > So this time sda&sdb do about 4 KB per transfer operation. [...] > # ggaoectl stats > # Statistics for device nbd6.0 > read_cnt: 504 > read_bytes: 516096 > read_time: 4.79039 > write_cnt: 204800 > write_bytes: 209715200 > write_time: 103.907 > other_cnt: 34 > other_time: 0.00017897 > io_slots: 58789 > io_runs: 58789 > queue_length: 2128789 > queue_stall: 0 > queue_over: 0 > ata_err: 0 > proto_err: 0 > > # Statistics for interface eth1 > rx_cnt: 205337 > rx_bytes: 217120220 > rx_runs: 58822 > rx_buffers_full: 0 > tx_cnt: 205338 > tx_bytes: 12832088 > tx_runs: 0 > tx_buffers_full: 0 > dropped: 0 > ignored: 0 > broadcast: 11 [...] > From the numbers above: ( 504 + 204800 ) / 58789 = 3.49 That's basically the same ratio as rx_cnt/rx_runs. That means on average there were 3.49 packets in the memory mapped ring buffer whenever ggaoed got woken up by the kernel, and ggaoed could almost always merge them into a single I/O request. So request merging works nicely, it's just that given the MTU being 1500, the average request size is still just 3.5 kB, less than a page size. That's very small for a modern disk. > I could conclude that I'm hitting a protocol limitation which you're trying > to workaround with GGAoEd (request merging) It's not a workaround, but an optimization: request merging should happen as high in the stack as possible, and it's certainly possible to do it at the AoE daemon level. However it's not a magic bullet. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- |
From: Gabor G. <go...@sz...> - 2010-01-08 15:59:14
|
On Fri, Jan 08, 2010 at 02:27:30PM +0200, Delian Krustev wrote: > ring-buffer-size = 0 > send-buffer-size = 1024 > receive-buffer-size = 1024 Hmm, wait. You're using traditional send()/receive() instead of the memory mapped ring buffer. Why? Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- |
From: Delian K. <kr...@kr...> - 2010-01-08 16:54:52
|
On Fri, 8 Jan 2010 16:59:04 +0100 Gabor Gombas wrote: > On Fri, Jan 08, 2010 at 02:27:30PM +0200, Delian Krustev wrote: > > > ring-buffer-size = 0 > > send-buffer-size = 1024 > > receive-buffer-size = 1024 > > Hmm, wait. You're using traditional send()/receive() instead of the > memory mapped ring buffer. Why? I've disabled it due to: Jan 6 15:38:24 ggaoed[1360]: net/eth1: Failed to set up the TX ring buffer: Protocol not available Jan 6 15:38:24 ggaoed[1360]: net/eth1: Set up 2048 KiB ring buffer (1344 RX/0 TX packets) It seems to me that ( when it is enabled in the ggaoed configs ) it is only enabled for the received packets. I was not able to notice any performance differences between the two, and disabled it in order to have the same settings for RX/TX . Cheers -- Delian |
From: Gabor G. <go...@sz...> - 2010-01-15 09:18:40
|
Hi, On Fri, Jan 08, 2010 at 06:54:42PM +0200, Delian Krustev wrote: > I've disabled it due to: > > Jan 6 15:38:24 ggaoed[1360]: net/eth1: Failed to set up the TX ring buffer: Protocol not available > Jan 6 15:38:24 ggaoed[1360]: net/eth1: Set up 2048 KiB ring buffer (1344 RX/0 TX packets) So you have a kernel older than 2.6.31. > It seems to me that ( when it is enabled in the ggaoed configs ) it is only enabled for the > received packets. Yes, the sending side needs a newer kernel. > I was not able to notice any performance differences between the two, and disabled it > in order to have the same settings for RX/TX . The effect is less CPU time used, if you're handling a large number of packets. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences, --------------------------------------------------------- |
From: Delian K. <kr...@kr...> - 2010-01-08 17:01:43
|
On Fri, 8 Jan 2010 16:50:41 +0100 Gabor Gombas wrote: > > I could conclude that I'm hitting a protocol limitation which you're trying > > to workaround with GGAoEd (request merging) > > It's not a workaround, but an optimization: request merging should > happen as high in the stack as possible, and it's certainly possible to > do it at the AoE daemon level. However it's not a magic bullet. I just hoped that more requests would have been merged and thus played with: queue-length max-delay merge-delay I was not able to get a better performance than with the default values though. Why is not possible to have more requests merged ? E.g. what seems logical to me is to have bigger queue, increase the delays and have more requests merged ? Cheers -- Delian |
From: Gabor G. <go...@sz...> - 2010-01-15 09:33:54
|
On Fri, Jan 08, 2010 at 07:01:34PM +0200, Delian Krustev wrote: > I just hoped that more requests would have been merged and thus played with: > > queue-length > max-delay > merge-delay > > I was not able to get a better performance than with the default values though. > > Why is not possible to have more requests merged ? > E.g. what seems logical to me is to have bigger queue, increase the delays and > have more requests merged ? What ggaoed does is basically: - read from the network until the kernel says "no more data" - merge the requests if possible, and submit them in one go >From your stats, on average the kernel queued about 3.5 packets by the time ggaoed got woken up, and it indeed could merge those request almost all of the time. If you want more merging, you have to increase the number of requests queued by the kernel. The queue length does not have a direct effect in this, although it certainly limits the _maximum_ merging that can be performed. The 'merge-delay' parameter tells ggaoed not to start the I/O immediately when there are no more incoming data available, but wait for the specified time to allow receiving more packets and do more merging. The 'merge-delay' however is directly added to the latency the clients experience for a single request, so setting it too high will also kill performance. I consider 'merge-delay' experimental at this time as I do not know if it really helps or not and I have no time to do extensive testing. But if you want to play with it, provided you have a queue size of N, measure/calculete how much time it takes for the client to send N/2 packets over the wire, and then set 'merge-delay' to this value. Then you can increase/decrease it slightly to see if it has any effect. The other option would be to tune your network driver to wait for more incoming packets before notifying the operating system, if it has such a capability. Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- |