Thread: Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

Brought to you by: ecashin, elcapitansam

aoetools-discuss

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: David L. <ta...@gm...> - 2014-07-13 22:24:03

So I do find it interesting to have a configuration to limit the size of
the read/write request but it seems like it would be useful to understand
the side affects on why someone would want to do this. Catalin suggested
that reducing the size of the jumbo frames decreases latency and improves
boot-times and said that the system "feels more response". This is were I
have a problem though because something "feeling" more responsive is not
very satisfying. It would be better to have some hard numbers behind what
this change does.

AoE using normal Ethernet frames end up having a protocol efficiency of
only 89.82% which on a 1Gb Ethernet would give you a theoretical maximum
throughput of ~112 MB/s. Going up to a 9000 byte frame bumps the efficiency
to 98.68% and a theoretical max throughput of ~123 MB/s. Something
interesting about jumbo frames though is that it ends up being able to
request 17 sectors of data per request.

Why is this interesting? Because on some Linux systems, a page size is 4096
or 8 sectors so the 17 sectors works out to 2 full pages plus touching into
another page. If you are not using direct IO but instead letting Linux
manage the underlying file system then it would seem like you will end up
making unaligned IO requests of the system causing additional I/Os to be
issued. This might be the reason for the latency affects and it would be
interesting to get the numbers that Catalin may have in his tests... I
wouldn't mind seeing results for 17, 16, 8 sector count requests.

But what I don't understand is that if the throughput is 80 MB/s and drops
to 60 MB/s as Catalin suggests then I don't get how a 20 MB/s drop in
throughput would make the system be more responsive ... I also don't
understand what the test setup would be to even measure the affects of
latency, throughput and having it correlate to responsiveness?

David

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: David & L. L. <dl...@po...> - 2014-07-15 02:16:22

Killer,

That confirms some of my suspicion. In my testing I can see requests for 
1024 sectors (512K) of data from the hard drive which the AoE client 
would have to carve up into individual read/write requests that can fit 
into an AoE packet. At the server, each of these requests would appear 
as individual read/write requests of the disk so if you followed the 
optimal packet usage for Ethernet for AoE you would end up with a jumbo 
frame request for 17 sectors. The initial 1024 sector request at the 
client would start on an aligned boundary for the first two 4k "sectors" 
but then have a trailing 512 byte sector request which will cause the 
next 7 requests to start off unaligned and end aligned... so a 1024 
sector request from the host OS will result in only 1 out of 8 requests 
starting on an aligned boundary.

Since the AoE client driver is handling disk requests from the host OS, 
the host OS is going to assume certain things about the disk and try to 
ensure proper alignment requests. I even think I've seen that if you 
have an application that is going to write to sector 1 that the host 
will read (or page) in the 4k chunk starting at sector 0 and then write 
out the 4k chunk at sector 0 with the modification of sector 1.

It seems like if we wanted to ensure alignment and support this 
configurable "max sector count" request size that the size we would want 
would be 16 to keep these large requests aligned and to ensure maximum 
efficiency for disk usage at the server. But this goes back to some of 
my original questions:

1) What is the test setup to determine the results of changing the max 
request size?
2) How does one measure latency and "responsiveness"?

David

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: Ed C. <ed....@ac...> - 2014-07-16 01:12:35

For measuring latency and responsiveness, fio is a great tool.  It's by 
the maintainer of the Linux kernel's block layer.  You can even export 
its data easily to data analysis software like GNU R or a python script 
that uses pandas.

There's a feature of the Linux kernel that I've never tried, but you 
might be interested.  The block layer has a tracing feature in recent 
kernels, and there's a blktrace tool that works with it.

On 07/14/2014 10:00 PM, David & Linda Leach wrote:
> Killer,
>
> That confirms some of my suspicion. In my testing I can see requests 
> for 1024 sectors (512K) of data from the hard drive which the AoE 
> client would have to carve up into individual read/write requests that 
> can fit into an AoE packet. At the server, each of these requests 
> would appear as individual read/write requests of the disk so if you 
> followed the optimal packet usage for Ethernet for AoE you would end 
> up with a jumbo frame request for 17 sectors. The initial 1024 sector 
> request at the client would start on an aligned boundary for the first 
> two 4k "sectors" but then have a trailing 512 byte sector request 
> which will cause the next 7 requests to start off unaligned and end 
> aligned... so a 1024 sector request from the host OS will result in 
> only 1 out of 8 requests starting on an aligned boundary.
>
> Since the AoE client driver is handling disk requests from the host 
> OS, the host OS is going to assume certain things about the disk and 
> try to ensure proper alignment requests. I even think I've seen that 
> if you have an application that is going to write to sector 1 that the 
> host will read (or page) in the 4k chunk starting at sector 0 and then 
> write out the 4k chunk at sector 0 with the modification of sector 1.
>
> It seems like if we wanted to ensure alignment and support this 
> configurable "max sector count" request size that the size we would 
> want would be 16 to keep these large requests aligned and to ensure 
> maximum efficiency for disk usage at the server. But this goes back to 
> some of my original questions:
>
> 1) What is the test setup to determine the results of changing the max 
> request size?
> 2) How does one measure latency and "responsiveness"?
>
> David
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
>
>
> _______________________________________________
> Aoetools-discuss mailing list
> Aoe...@li...
> https://lists.sourceforge.net/lists/listinfo/aoetools-discuss

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: Killer{R} <su...@ki...> - 2014-07-14 19:29:59

Hello David,

Monday, July 14, 2014, 1:23:56 AM, you wrote:

IMHO problem caused mostly not only by page size, but also by HDD's
sector size. Nowadays HDDs has 4K physical sector size. However they
support accessing by 512 bytes, but this is ineffective, cause every
unaligned read access that doesnt fit into 4K sector will resulted
into 4K read, and every unaligned write - will cause disk to read
sector's data, modify it internally in buffers and the write it back.
Sure, firmware tries to do this in fastest way, but my tests shows
that there'is about 20..30% sequential write speed degradation (with
O_DIRECT) on writing 4K blocks if begining of each block is not
aligned to 4K too. So simple using jumbo frames is not enough to make
hardware work as fast as it can.
AoE protocol doesn't support 4K sectors directly, cause it should
support 'normal' MTU, but not only jumbo frames. However its
theoretically possible to make initiator report OS that its '4K
sector drive' and proper ('4K sector aware' :) ) OS will then access
it by 4K-aligned portions, that together with some buffering at
target's side should make it all work faster :). But its all looks
like a tricky workaround.



DL> So I do find it interesting to have a configuration to limit the size of
DL> the read/write request but it seems like it would be useful to understand
DL> the side affects on why someone would want to do this. Catalin suggested
DL> that reducing the size of the jumbo frames decreases latency and improves
DL> boot-times and said that the system "feels more response". This is were I
DL> have a problem though because something "feeling" more responsive is not
DL> very satisfying. It would be better to have some hard numbers behind what
DL> this change does.

DL> AoE using normal Ethernet frames end up having a protocol efficiency of
DL> only 89.82% which on a 1Gb Ethernet would give you a theoretical maximum
DL> throughput of ~112 MB/s. Going up to a 9000 byte frame bumps the efficiency
DL> to 98.68% and a theoretical max throughput of ~123 MB/s. Something
DL> interesting about jumbo frames though is that it ends up being able to
DL> request 17 sectors of data per request.

DL> Why is this interesting? Because on some Linux systems, a page size is 4096
DL> or 8 sectors so the 17 sectors works out to 2 full pages plus touching into
DL> another page. If you are not using direct IO but instead letting Linux
DL> manage the underlying file system then it would seem like you will end up
DL> making unaligned IO requests of the system causing additional I/Os to be
DL> issued. This might be the reason for the latency affects and it would be
DL> interesting to get the numbers that Catalin may have in his tests... I
DL> wouldn't mind seeing results for 17, 16, 8 sector count requests.

DL> But what I don't understand is that if the throughput is 80 MB/s and drops
DL> to 60 MB/s as Catalin suggests then I don't get how a 20 MB/s drop in
DL> throughput would make the system be more responsive ... I also don't
DL> understand what the test setup would be to even measure the affects of
DL> latency, throughput and having it correlate to responsiveness?

DL> David
 



-- 
Best regards,
 Killer{R}                            mailto:su...@ki...

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: Ed C. <ed....@ac...> - 2014-07-15 01:47:10

On 07/13/2014 06:23 PM, David Leach wrote:
> So I do find it interesting to have a configuration to limit the size 
> of the read/write request but it seems like it would be useful to 
> understand the side affects on why someone would want to do this. 
> Catalin suggested that reducing the size of the jumbo frames decreases 
> latency and improves boot-times and said that the system "feels more 
> response". This is were I have a problem though because something 
> "feeling" more responsive is not very satisfying. It would be better 
> to have some hard numbers behind what this change does.

Yes, I agree.  If Catalin posts the patch here, then perhaps any 
interested parties would be able to gather some data.

[Leach correctly notes that some jumbos carry ...]
> 17 sectors of data per request.

There is often a lot going on there.  For example, if the initiator host 
is using a filesystem, then writes will dirty pages of memory that are 
buffering the data from the AoE device.  The virtual memory subsystem 
will flush that data when it gets around to it, using whatever chunks it 
likes, then the block layer will probably consolidate or split the I/O 
as it likes inside the I/O scheduler, and only then will the aoe 
initiator get the data.

But the aoe driver will set up network buffers (sk_buff structures) that 
point right into the memory associated with the I/O.  The network card 
itself often does the transfer from RAM into the card and vice versa.  
I'm not sure there's a significant penalty paid for telling the NIC to 
DMA seventeen sectors.  It would be a good test to do in the aoe driver 
with a few different representative NICs.

Further, on the target side, there's no guarantee that the target will 
do the I/O in exactly the same chunks that appear in the AoE packets.  
Even disk drives have elevator algorithms scheduling I/O from write buffers.

I agree that test results here would be interesting, but a big "Your 
Mileage May Vary" should accompany the results.

-- 
   Ed

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: David L. <ta...@gm...> - 2014-07-15 05:29:20

Ed,

I'm less concerned about the initiator side as we don't really have direct
control over what it requests. What I suggest is that these requests from
the host on the initiator will likely be aligned requests due to how their
file system works to try to keep things efficient. If we then cause the
resulting AoE requests to the server be unaligned accesses then that will
likely cause additional IO transactions to the file system which would then
likely cause latency delays on the responses to the these requests.

David

Re: [Aoetools-discuss] vblade-22-rc1 is first release candidate for version 22

From: Ed C. <ed....@ac...> - 2014-07-16 01:28:36

On 07/15/2014 01:29 AM, David Leach wrote:
> Ed,
>
> I'm less concerned about the initiator side as we don't really have 
> direct control over what it requests. What I suggest is that these 
> requests from the host on the initiator will likely be aligned 
> requests due to how their file system works to try to keep things 
> efficient. If we then cause the resulting AoE requests to the server 
> be unaligned accesses then that will likely cause additional IO 
> transactions to the file system which would then likely cause latency 
> delays on the responses to the these requests.

As long as you don't specify the sync or direct options, though, the 
vblade will write to a buffered backing store.  Then the ultimate 
backing store (e.g., disk drive), the ultimate driver (e.g., SCSI 
layer), the block layer, the middle layer (e.g., dm and md), the VM 
subsystem and (if it's a file) the filesystem will get a chance to merge 
and align I/O.

-- 
   Ed