Thread: Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22
Brought to you by:
ecashin,
elcapitansam
From: David L. <ta...@gm...> - 2014-07-13 22:24:03
|
So I do find it interesting to have a configuration to limit the size of the read/write request but it seems like it would be useful to understand the side affects on why someone would want to do this. Catalin suggested that reducing the size of the jumbo frames decreases latency and improves boot-times and said that the system "feels more response". This is were I have a problem though because something "feeling" more responsive is not very satisfying. It would be better to have some hard numbers behind what this change does. AoE using normal Ethernet frames end up having a protocol efficiency of only 89.82% which on a 1Gb Ethernet would give you a theoretical maximum throughput of ~112 MB/s. Going up to a 9000 byte frame bumps the efficiency to 98.68% and a theoretical max throughput of ~123 MB/s. Something interesting about jumbo frames though is that it ends up being able to request 17 sectors of data per request. Why is this interesting? Because on some Linux systems, a page size is 4096 or 8 sectors so the 17 sectors works out to 2 full pages plus touching into another page. If you are not using direct IO but instead letting Linux manage the underlying file system then it would seem like you will end up making unaligned IO requests of the system causing additional I/Os to be issued. This might be the reason for the latency affects and it would be interesting to get the numbers that Catalin may have in his tests... I wouldn't mind seeing results for 17, 16, 8 sector count requests. But what I don't understand is that if the throughput is 80 MB/s and drops to 60 MB/s as Catalin suggests then I don't get how a 20 MB/s drop in throughput would make the system be more responsive ... I also don't understand what the test setup would be to even measure the affects of latency, throughput and having it correlate to responsiveness? David |
From: David & L. L. <dl...@po...> - 2014-07-15 02:16:22
|
Killer, That confirms some of my suspicion. In my testing I can see requests for 1024 sectors (512K) of data from the hard drive which the AoE client would have to carve up into individual read/write requests that can fit into an AoE packet. At the server, each of these requests would appear as individual read/write requests of the disk so if you followed the optimal packet usage for Ethernet for AoE you would end up with a jumbo frame request for 17 sectors. The initial 1024 sector request at the client would start on an aligned boundary for the first two 4k "sectors" but then have a trailing 512 byte sector request which will cause the next 7 requests to start off unaligned and end aligned... so a 1024 sector request from the host OS will result in only 1 out of 8 requests starting on an aligned boundary. Since the AoE client driver is handling disk requests from the host OS, the host OS is going to assume certain things about the disk and try to ensure proper alignment requests. I even think I've seen that if you have an application that is going to write to sector 1 that the host will read (or page) in the 4k chunk starting at sector 0 and then write out the 4k chunk at sector 0 with the modification of sector 1. It seems like if we wanted to ensure alignment and support this configurable "max sector count" request size that the size we would want would be 16 to keep these large requests aligned and to ensure maximum efficiency for disk usage at the server. But this goes back to some of my original questions: 1) What is the test setup to determine the results of changing the max request size? 2) How does one measure latency and "responsiveness"? David |
From: Ed C. <ed....@ac...> - 2014-07-16 01:12:35
|
For measuring latency and responsiveness, fio is a great tool. It's by the maintainer of the Linux kernel's block layer. You can even export its data easily to data analysis software like GNU R or a python script that uses pandas. There's a feature of the Linux kernel that I've never tried, but you might be interested. The block layer has a tracing feature in recent kernels, and there's a blktrace tool that works with it. On 07/14/2014 10:00 PM, David & Linda Leach wrote: > Killer, > > That confirms some of my suspicion. In my testing I can see requests > for 1024 sectors (512K) of data from the hard drive which the AoE > client would have to carve up into individual read/write requests that > can fit into an AoE packet. At the server, each of these requests > would appear as individual read/write requests of the disk so if you > followed the optimal packet usage for Ethernet for AoE you would end > up with a jumbo frame request for 17 sectors. The initial 1024 sector > request at the client would start on an aligned boundary for the first > two 4k "sectors" but then have a trailing 512 byte sector request > which will cause the next 7 requests to start off unaligned and end > aligned... so a 1024 sector request from the host OS will result in > only 1 out of 8 requests starting on an aligned boundary. > > Since the AoE client driver is handling disk requests from the host > OS, the host OS is going to assume certain things about the disk and > try to ensure proper alignment requests. I even think I've seen that > if you have an application that is going to write to sector 1 that the > host will read (or page) in the 4k chunk starting at sector 0 and then > write out the 4k chunk at sector 0 with the modification of sector 1. > > It seems like if we wanted to ensure alignment and support this > configurable "max sector count" request size that the size we would > want would be 16 to keep these large requests aligned and to ensure > maximum efficiency for disk usage at the server. But this goes back to > some of my original questions: > > 1) What is the test setup to determine the results of changing the max > request size? > 2) How does one measure latency and "responsiveness"? > > David > > > ------------------------------------------------------------------------------ > Want fast and easy access to all the code in your enterprise? Index and > search up to 200,000 lines of code with a free copy of Black Duck > Code Sight - the same software that powers the world's largest code > search on Ohloh, the Black Duck Open Hub! Try it now. > http://p.sf.net/sfu/bds > > > _______________________________________________ > Aoetools-discuss mailing list > Aoe...@li... > https://lists.sourceforge.net/lists/listinfo/aoetools-discuss |
From: Killer{R} <su...@ki...> - 2014-07-14 19:29:59
|
Hello David, Monday, July 14, 2014, 1:23:56 AM, you wrote: IMHO problem caused mostly not only by page size, but also by HDD's sector size. Nowadays HDDs has 4K physical sector size. However they support accessing by 512 bytes, but this is ineffective, cause every unaligned read access that doesnt fit into 4K sector will resulted into 4K read, and every unaligned write - will cause disk to read sector's data, modify it internally in buffers and the write it back. Sure, firmware tries to do this in fastest way, but my tests shows that there'is about 20..30% sequential write speed degradation (with O_DIRECT) on writing 4K blocks if begining of each block is not aligned to 4K too. So simple using jumbo frames is not enough to make hardware work as fast as it can. AoE protocol doesn't support 4K sectors directly, cause it should support 'normal' MTU, but not only jumbo frames. However its theoretically possible to make initiator report OS that its '4K sector drive' and proper ('4K sector aware' :) ) OS will then access it by 4K-aligned portions, that together with some buffering at target's side should make it all work faster :). But its all looks like a tricky workaround. DL> So I do find it interesting to have a configuration to limit the size of DL> the read/write request but it seems like it would be useful to understand DL> the side affects on why someone would want to do this. Catalin suggested DL> that reducing the size of the jumbo frames decreases latency and improves DL> boot-times and said that the system "feels more response". This is were I DL> have a problem though because something "feeling" more responsive is not DL> very satisfying. It would be better to have some hard numbers behind what DL> this change does. DL> AoE using normal Ethernet frames end up having a protocol efficiency of DL> only 89.82% which on a 1Gb Ethernet would give you a theoretical maximum DL> throughput of ~112 MB/s. Going up to a 9000 byte frame bumps the efficiency DL> to 98.68% and a theoretical max throughput of ~123 MB/s. Something DL> interesting about jumbo frames though is that it ends up being able to DL> request 17 sectors of data per request. DL> Why is this interesting? Because on some Linux systems, a page size is 4096 DL> or 8 sectors so the 17 sectors works out to 2 full pages plus touching into DL> another page. If you are not using direct IO but instead letting Linux DL> manage the underlying file system then it would seem like you will end up DL> making unaligned IO requests of the system causing additional I/Os to be DL> issued. This might be the reason for the latency affects and it would be DL> interesting to get the numbers that Catalin may have in his tests... I DL> wouldn't mind seeing results for 17, 16, 8 sector count requests. DL> But what I don't understand is that if the throughput is 80 MB/s and drops DL> to 60 MB/s as Catalin suggests then I don't get how a 20 MB/s drop in DL> throughput would make the system be more responsive ... I also don't DL> understand what the test setup would be to even measure the affects of DL> latency, throughput and having it correlate to responsiveness? DL> David -- Best regards, Killer{R} mailto:su...@ki... |
From: Ed C. <ed....@ac...> - 2014-07-15 01:47:10
|
On 07/13/2014 06:23 PM, David Leach wrote: > So I do find it interesting to have a configuration to limit the size > of the read/write request but it seems like it would be useful to > understand the side affects on why someone would want to do this. > Catalin suggested that reducing the size of the jumbo frames decreases > latency and improves boot-times and said that the system "feels more > response". This is were I have a problem though because something > "feeling" more responsive is not very satisfying. It would be better > to have some hard numbers behind what this change does. Yes, I agree. If Catalin posts the patch here, then perhaps any interested parties would be able to gather some data. [Leach correctly notes that some jumbos carry ...] > 17 sectors of data per request. There is often a lot going on there. For example, if the initiator host is using a filesystem, then writes will dirty pages of memory that are buffering the data from the AoE device. The virtual memory subsystem will flush that data when it gets around to it, using whatever chunks it likes, then the block layer will probably consolidate or split the I/O as it likes inside the I/O scheduler, and only then will the aoe initiator get the data. But the aoe driver will set up network buffers (sk_buff structures) that point right into the memory associated with the I/O. The network card itself often does the transfer from RAM into the card and vice versa. I'm not sure there's a significant penalty paid for telling the NIC to DMA seventeen sectors. It would be a good test to do in the aoe driver with a few different representative NICs. Further, on the target side, there's no guarantee that the target will do the I/O in exactly the same chunks that appear in the AoE packets. Even disk drives have elevator algorithms scheduling I/O from write buffers. I agree that test results here would be interesting, but a big "Your Mileage May Vary" should accompany the results. -- Ed |
From: David L. <ta...@gm...> - 2014-07-15 05:29:20
|
Ed, I'm less concerned about the initiator side as we don't really have direct control over what it requests. What I suggest is that these requests from the host on the initiator will likely be aligned requests due to how their file system works to try to keep things efficient. If we then cause the resulting AoE requests to the server be unaligned accesses then that will likely cause additional IO transactions to the file system which would then likely cause latency delays on the responses to the these requests. David |
From: Ed C. <ed....@ac...> - 2014-07-16 01:28:36
|
On 07/15/2014 01:29 AM, David Leach wrote: > Ed, > > I'm less concerned about the initiator side as we don't really have > direct control over what it requests. What I suggest is that these > requests from the host on the initiator will likely be aligned > requests due to how their file system works to try to keep things > efficient. If we then cause the resulting AoE requests to the server > be unaligned accesses then that will likely cause additional IO > transactions to the file system which would then likely cause latency > delays on the responses to the these requests. As long as you don't specify the sync or direct options, though, the vblade will write to a buffered backing store. Then the ultimate backing store (e.g., disk drive), the ultimate driver (e.g., SCSI layer), the block layer, the middle layer (e.g., dm and md), the VM subsystem and (if it's a file) the filesystem will get a chance to merge and align I/O. -- Ed |