Thread: [Aoetools-discuss] ggaoed and ro filesystem during heavy write
Brought to you by:
ecashin,
elcapitansam
From: Lars T. <ta...@bb...> - 2014-01-15 16:09:40
|
Hi, I experience some problems with the latest ggaoed version and a fresh ubuntu 14.04 aoe client (from the daily snapshots). http://code.google.com/p/ggaoed/source/list The kernel version on the client side is 3.13.0-3-generic # modinfo aoe filename: /lib/modules/3.13.0-3-generic/kernel/drivers/block/aoe/aoe.ko version: 85 description: AoE block/char driver for 2.6.2 and newer 2.6 kernels author: Sam Hopkins <sa...@co...> license: GPL srcversion: 5F0AC5D858A1164C5170585 The client is a testing box but the server is in productive state for years. So I can't change the server config. I did a tcpdump and see that the server stops sending a response to the last write request of a series of write requests. 9 seconds after the client waited for responses without receiving any paket from the target it issues a "Query Config Information Request" and marks the device as read only. This results in a read-only filesystem. The responses to the "Query Config Information Requests" can be seen right after the requests. I can "repair" this with an aoe-revalidate and remounting rw. But this appears to happen right with the next longer write operation. I'm stuck here. It seems the client doesn't resend unresponded requests. Is this on purpose? Thanks Lars |
From: James R. L. <jl...@in...> - 2014-01-15 16:35:48
|
We see a similar issue with vblade when it becomes CPU starved due to resource contention on our AOE server. It would be nice if in these situations the AOE client would queueue write blocks and resend unack'd writes. On Wed, Jan 15, 2014 at 04:52:36PM +0100, Lars Täuber wrote: > Hi, > > I experience some problems with the latest ggaoed version and a fresh ubuntu 14.04 aoe client (from the daily snapshots). > > http://code.google.com/p/ggaoed/source/list > > The kernel version on the client side is 3.13.0-3-generic > > > # modinfo aoe > filename: /lib/modules/3.13.0-3-generic/kernel/drivers/block/aoe/aoe.ko > version: 85 > description: AoE block/char driver for 2.6.2 and newer 2.6 kernels > author: Sam Hopkins <sa...@co...> > license: GPL > srcversion: 5F0AC5D858A1164C5170585 > > The client is a testing box but the server is in productive state for years. So I can't change the server config. > > > I did a tcpdump and see that the server stops sending a response to the last write request of a series of write requests. > 9 seconds after the client waited for responses without receiving any paket from the target it issues a "Query Config Information Request" and marks the device as read only. This results in a read-only filesystem. > The responses to the "Query Config Information Requests" can be seen right after the requests. > > I can "repair" this with an aoe-revalidate and remounting rw. > But this appears to happen right with the next longer write operation. > > I'm stuck here. > > It seems the client doesn't resend unresponded requests. Is this on purpose? > > Thanks > Lars > > ------------------------------------------------------------------------------ > CenturyLink Cloud: The Leader in Enterprise Cloud Services. > Learn Why More Businesses Are Choosing CenturyLink Cloud For > Critical Workloads, Development Environments & Everything In Between. > Get a Quote or Start a Free Trial Today. > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk > _______________________________________________ > Aoetools-discuss mailing list > Aoe...@li... > https://lists.sourceforge.net/lists/listinfo/aoetools-discuss -- James R. Leu | Director of Technology | INOC | Madison, WI, USA O: +1-608-204-0203 | F: +1-608-663-4558 | jl...@in... | www.inoc.com Service. Not Software.® |
From: Ed C. <ec...@co...> - 2014-01-16 14:03:30
|
The AoE initiator (the side using the storage) called "aoe" does retransmit AoE write commands for aoe_deadsecs seconds. The virtual memory subsystem does buffer writes to filesystems. The aoe_deadsecs module parameter is configurable. An issue that is possibly related to your problems is briefly described below. Often the problem is not too little buffering of writes but too much of it. For writes to a filesystem, the data is actually modified in RAM, then at some point later, the dirty data in RAM is flushed out to the persistent storage. If the system waits too long, it can cause things to get clogged up. In a nutshell, the virtual memory subsystem's defaults were created before 64-bit systems were common and before large amounts of RAM were common. You can use some VM settings to encourage dirty pages writes to be written out by the process generating the writes more quickly, so that performance is more consistent. some example settings in the EtherDrive HOWTO FAQ: http://support.coraid.com/support/linux/EtherDrive-2.6-HOWTO-5.html#ss5.19 Linux Weekly News article about this problem: http://lwn.net/Articles/572911/ On 1/15/14, 11:14 AM, James R. Leu wrote: > We see a similar issue with vblade when it becomes CPU starved > due to resource contention on our AOE server. > > It would be nice if in these situations the AOE client would queueue write > blocks and resend unack'd writes. > > On Wed, Jan 15, 2014 at 04:52:36PM +0100, Lars Täuber wrote: >> Hi, >> >> I experience some problems with the latest ggaoed version and a fresh ubuntu 14.04 aoe client (from the daily snapshots). >> >> http://code.google.com/p/ggaoed/source/list >> >> The kernel version on the client side is 3.13.0-3-generic >> >> >> # modinfo aoe >> filename: /lib/modules/3.13.0-3-generic/kernel/drivers/block/aoe/aoe.ko >> version: 85 >> description: AoE block/char driver for 2.6.2 and newer 2.6 kernels >> author: Sam Hopkins <sa...@co...> >> license: GPL >> srcversion: 5F0AC5D858A1164C5170585 >> >> The client is a testing box but the server is in productive state for years. So I can't change the server config. >> >> >> I did a tcpdump and see that the server stops sending a response to the last write request of a series of write requests. >> 9 seconds after the client waited for responses without receiving any paket from the target it issues a "Query Config Information Request" and marks the device as read only. This results in a read-only filesystem. >> The responses to the "Query Config Information Requests" can be seen right after the requests. >> >> I can "repair" this with an aoe-revalidate and remounting rw. >> But this appears to happen right with the next longer write operation. >> >> I'm stuck here. >> >> It seems the client doesn't resend unresponded requests. Is this on purpose? >> >> Thanks >> Lars >> >> ------------------------------------------------------------------------------ >> CenturyLink Cloud: The Leader in Enterprise Cloud Services. >> Learn Why More Businesses Are Choosing CenturyLink Cloud For >> Critical Workloads, Development Environments & Everything In Between. >> Get a Quote or Start a Free Trial Today. >> http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk >> _______________________________________________ >> Aoetools-discuss mailing list >> Aoe...@li... >> https://lists.sourceforge.net/lists/listinfo/aoetools-discuss > > > > ------------------------------------------------------------------------------ > CenturyLink Cloud: The Leader in Enterprise Cloud Services. > Learn Why More Businesses Are Choosing CenturyLink Cloud For > Critical Workloads, Development Environments & Everything In Between. > Get a Quote or Start a Free Trial Today. > http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk > > > _______________________________________________ > Aoetools-discuss mailing list > Aoe...@li... > https://lists.sourceforge.net/lists/listinfo/aoetools-discuss |
From: Ed C. <ec...@co...> - 2014-01-17 15:18:02
|
On Jan 16, 2014, at 10:42 AM, Lars Täuber <ta...@bb...<mailto:ta...@bb...>> wrote: Hi Ed, Thu, 16 Jan 2014 09:03:14 -0500 Ed Cashin <ec...@co...<mailto:ec...@co...>> ==> <jl...@in...<mailto:jl...@in...>> : The AoE initiator (the side using the storage) called "aoe" does retransmit AoE write commands for aoe_deadsecs seconds. what's the default value for this parameter? 180. When an old request gets a response after the AoE command has been retransmitted already, it is an “unexpected response”. You can watch for those by doing a “cat” on /dev/etherd/err. — Ed |
From: Lars T. <ta...@bb...> - 2014-01-20 07:56:31
|
Hi Ed, does aoe "accept" the unexpected responses? What can be the reason that aoe issues Query Config Information Requests? Lars Fri, 17 Jan 2014 15:17:50 +0000 Ed Cashin <ec...@co...> ==> Lars Täuber <ta...@bb...> : > On Jan 16, 2014, at 10:42 AM, Lars Täuber <ta...@bb...<mailto:ta...@bb...>> wrote: > > Hi Ed, > > > Thu, 16 Jan 2014 09:03:14 -0500 > Ed Cashin <ec...@co...<mailto:ec...@co...>> ==> <jl...@in...<mailto:jl...@in...>> : > The AoE initiator (the side using the storage) called "aoe" does > retransmit AoE write commands for aoe_deadsecs seconds. > > what's the default value for this parameter? > > 180. > > When an old request gets a response after the AoE command has been retransmitted already, it is an “unexpected response”. You can watch for those by doing a “cat” on /dev/etherd/err. > > — > Ed -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstrasse 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de |
From: Ed C. <ec...@co...> - 2014-01-21 21:17:24
|
The current version does use the response (since it's a real response) to update the round trip time statistics. It doesn't use it to complete an I/O request from the block layer. It also decrements the "lost" counter associated with the remote MAC address when an unexpected response is received, because it was incremented when we decided to retransmit, based on the assumption that the packet was lost, and the unexpected response invalidates that assumption. The aoe driver routinely issues AoE Query Config broadcasts, so that new targets that appear on the network will be detected even if they don't send an unsolicited AoE Config Query response to announce their presence. (Sometimes switches eat packets when a link comes up, for example, until a spanning tree algorithm runs for a while, so that these unsolicited announcements from targets can get lost.) When the last user space or kernel user of an aoe-exported block device closes it, the aoe driver issues a directed AoE Query Config to that specific AoE target. When the user runs aoe-revalidate or aoe-discover, it triggers the sending of AoE Query Config commands. On Jan 20, 2014, at 2:56 AM, Lars Täuber <ta...@bb...> wrote: > Hi Ed, > > does aoe "accept" the unexpected responses? > What can be the reason that aoe issues Query Config Information Requests? > > Lars > > Fri, 17 Jan 2014 15:17:50 +0000 > Ed Cashin <ec...@co...> ==> Lars Täuber <ta...@bb...> : >> On Jan 16, 2014, at 10:42 AM, Lars Täuber <ta...@bb...<mailto:ta...@bb...>> wrote: >> >> Hi Ed, >> >> >> Thu, 16 Jan 2014 09:03:14 -0500 >> Ed Cashin <ec...@co...<mailto:ec...@co...>> ==> <jl...@in...<mailto:jl...@in...>> : >> The AoE initiator (the side using the storage) called "aoe" does >> retransmit AoE write commands for aoe_deadsecs seconds. >> >> what's the default value for this parameter? >> >> 180. >> >> When an old request gets a response after the AoE command has been retransmitted already, it is an “unexpected response”. You can watch for those by doing a “cat” on /dev/etherd/err. >> >> — >> Ed > > > -- > Informationstechnologie > Berlin-Brandenburgische Akademie der Wissenschaften > Jägerstrasse 22-23 10117 Berlin > Tel.: +49 30 20370-352 http://www.bbaw.de |