Thread: Re: [Aoetools-discuss] Trouble with vblade
Brought to you by:
ecashin,
elcapitansam
From: Andrei L. <an...@la...> - 2006-11-22 09:06:17
|
Adi Kriegisch wrote: > Hi Sam! > > Thank you for your quick reply! > >> Just for sanity, what version of vblade are you using? I didn't think >> I broke anything with the latest release ... > I first tried with v13; yesterday I repeated all the tests with v14 with > exactly the same results (and though decided to ask for help). > Module aoe is version 39 running in a stock debian sid kernel with: > "vermagic: 2.6.18-2-xen-686 SMP mod_unload 686 REGPARM gcc-4.1" > > [SNIP] >>> This is reproduceable in both directions. The funny thing is that it >>> works like a charm via loopback on the same server. >> The output of aoe-stat would be helpful to ensure you can actually see >> what you think you can see. Also, if you > output of aoe-stat (I always checked that and just left it out in my posting > to not make it too long and unreadable): > e0.0 5.009GB eth0 up > At the beginning I did the testing just with killing vblade and removing and > reinserting module aoe. After some while I decided to reboot the machines > after a test run for not having anything left over. > >> cat /dev/etherd/err > Shows nothing throughout all the tests. > >> you might get an idea of whether communication is having to retransmit >> a lot. It could be a cabling issue. > This happened sometimes with my tests on the loopback device: > (from /var/log/syslog) > Nov 16 22:50:29 tritium kernel: aoe: e1.0: setting 1024 byte data frames on > lo:000000000000 > Nov 16 22:50:29 tritium kernel: aoe: e1.0: setting 16384 byte data frames on > lo:000000000000 > Nov 16 22:51:29 tritium kernel: aoe: e1.0: setting 1024 byte data frames on > lo:000000000000 > Nov 16 22:52:29 tritium kernel: aoe: e1.0: setting 16384 byte data frames on > lo:000000000000 > Nov 16 22:52:29 tritium kernel: aoe: e1.0: setting 1024 byte data frames on > lo:000000000000 > > But never ever on ethernet. > >>> My hardware configuration: >>> both server are dual Intel PIII 1400MHz with 3GB RAM >>> Network adapter is Ethernet controller: Intel Corporation 82557/8/9 >>> [Ethernet Pro 100] (rev 0c) used with e100 driver. (yes, the network is >>> 100MBit) Networking using several different protocols works like a charm >>> with the performance one can expect from a 100MBit Network. >> You're not perchance connecting the servers directly to each other >> without an intervening switch, are you? I've never seen 100MbE that >> did auto MDIX. > No; there is a switch inbetween. Communication via other prtocols works at > full speed; ping -f isn't loosing packets. I even tried to increase network > buffer size as specified for GBit ethernet in the README file; but this also > had no effect. Try to use cross-cable. > I had an strace running on the vblade process that just showed read and write > operations and was identical to a session using the loopback device. So, > nothing unusual there. > > Any automated tests I could run? Anything else I could check? May I provide > you with access to the servers (just send me a private mail!)? >>From my point of view there are three things I am not sure about: first is > that the machine is smp. Second: the kernel is not stock but runs with xen > patches and third maybe there is an issue with the nics and their driver > (e100)?! > > Any further hints highly appreciated! :-) You may also try to use LiveCD on your server. BTW, what partition type are you using? Andrei -- Lan.Art s.r.l. via Co' del Panico 36/1 35028 Piove di Sacco (PD) tel. 049-7966424 fax 049-7966600 http://www.lanart.it |
From: Adi K. <ad...@cg...> - 2006-11-22 20:49:38
Attachments:
aoe.tcpdump
|
Hi Sam! Thank you very much for your help! > Please rerun the vblade test 2, but use tcpdump to capture the > packet flow on the server running vblade: > > tcpdump -i eth0 -w aoe.tcpdump ether proto 0x88a2 > > If you can send me the aoe.tcpdump file I can examine whether the > offset occurs before the vblade gets the packet, or after. Be sure > to run tcpdump before starting vblade so I can see the whole > communication. I did, find the file attached. I am very curious about the results! ;-) -- Adi |
From: Sam H. <sa...@co...> - 2006-11-22 21:44:36
|
>> If you can send me the aoe.tcpdump file I can examine whether the >> offset occurs before the vblade gets the packet, or after. Be sure >> to run tcpdump before starting vblade so I can see the whole >> communication. > > I did, find the file attached. I am very curious about the results! ;-) The write is offset out of the initiator (ie, before processing by vblade): 15:43:53.006759 00:09:6b:b0:5b:cf (oui Unknown) > 00:09:6b:b0:16:db (oui Unknown), ethertype Unknown (0x88a2), length 1060: 0x0000: 1000 0000 0000 0075 c56e 4100 0234 0000 .......u.nA..4.. 0x0010: 0000 0000 0000 *0000 0000 0000 0000 0000 ................ 0x0020: 0000 0000 0000 0000 0000 0000 0000 4144 ..............AD 0x0030: 4941 4449 4144 490a 0000 0000 0000 0000 IADIADI......... 0x0040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0x0050: 0000 .. The ata data for the write starts at the asterisk. Starting with aoe6-24 we began doing "zero copy writes" (ZCW). We fill out an sk_buff structure so that if the network card supports it we don't have to copy the data out of the block system (bio) buffer into the packet. The network card can do that when it puts the packet into its transmit FIFO. If this isn't supported by the network card, the network subsystem is responsible for calling skb_linearize on the sk_buff, which will copy the data into a linear buffer. I seem to recall this being in the network driver, but that doesn't seem like it could be right. Network drivers shouldn't have to accomodate workarounds for every feature other cards support. My only guess right now is that something in your kernel doesn't like having to linearize the skb. You can test this theory by removing the aoe module from your system and trying an old version of the aoe driver without the ZCW feature: http://www.coraid.com/support/linux/aoe6-23.tar.gz The aoe6-23 driver doesn't support jumbo frames, but right now, neither do you. :) If this driver makes the problem go away then either we have a bug in our ZCW setup of the sk_buff, the network system has a bug in processing it ... or both! Cheers, Sam |