From: Jiaqing Du <ji...@gm...> - 2009-04-28 12:26:35
|
Hi, Jesse, Thank you for your reply. Here is some information about my experiment. What might be the possible bottleneck ? 0. I'm playing with two Sun Fire X4600 M2 servers. 1. Both of the two NICs sit on PCIe x8 slots. #lspci -v -v -s 82:00.0 82:00.0 Ethernet controller: Intel Corporation 10 Gigabit AT CX4 Network Connection (rev 01) Subsystem: Intel Corporation Device a01f Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 43 Region 0: Memory at fe9e0000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fe980000 (32-bit, non-prefetchable) [size=256K] Region 2: I/O ports at cc00 [size=32] Region 3: Memory at fe9dc000 (32-bit, non-prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [60] MSI-X: Enable+ Mask- TabSize=18 Vector table: BAR=3 offset=00000000 PBA: BAR=3 offset=00002000 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #1, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 <4us, L1 <64us ClockPM- Suprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [100] Advanced Error Reporting <?> Capabilities: [140] Device Serial Number d6-3c-2e-ff-ff-21-1b-00 Kernel driver in use: ixgbe Kernel modules: ixgbe 2. Memory bandwith is enough. I tested the memory bandwidth by running mutiple netperf instances locally. 3. The kernel version is 2.6.29.1 default value of some network buffer sizes: net.core.rmem_max = 131071 net.core.wmem_max = 131071 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_rmem = 4096 87380 4194304 net.core.netdev_max_backlog = 1000 I also changed them as follows, but it didn't really help the max throughput. net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 250000 4. One port correponds to 16 TxRx queues. Since each machine has 16 cores, by setting irq smp_affinity, one core takes care of one TxRx queue. $cat /proc/interrupts | grep eth 104: 395814 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-0 105: 44422 379661 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-1 106: 348 0 565089 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-2 107: 39320 0 0 513652 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-3 108: 301 0 0 0 342316 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-4 109: 72490 0 0 0 0 266674 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-5 110: 247 0 0 0 0 0 294115 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-6 111: 244 0 0 0 0 0 0 249589 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-7 112: 254 0 0 0 0 0 0 0 534864 0 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-8 113: 248 0 0 0 0 0 0 0 0 391910 0 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-9 114: 240 0 0 0 0 0 0 0 0 0 612412 0 0 0 0 0 PCI-MSI-edge eth17-TxRx-10 115: 69609 0 0 0 0 0 0 0 0 0 0 69917 0 0 0 0 PCI-MSI-edge eth17-TxRx-11 116: 44886 0 0 0 0 0 0 0 0 0 0 0 337235 0 0 0 PCI-MSI-edge eth17-TxRx-12 117: 404 0 0 0 0 0 0 0 0 0 0 0 0 344160 0 0 PCI-MSI-edge eth17-TxRx-13 118: 44373 0 0 0 0 0 0 0 0 0 0 0 0 0 212061 0 PCI-MSI-edge eth17-TxRx-14 119: 83013 0 0 0 0 0 0 0 0 0 0 0 0 0 0 127370 PCI-MSI-edge eth17-TxRx-15 120: 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI-edge eth17:lsc 5. output of netperf (based on default network buffer sizes) 1) mtu == 1500 For a single netperf instance, output of netperf on the transmisstion side: $ netperf -H 192.168.0.6 -p 10000 -L 192.168.0.7 -l 10 -t TCP_STREAM -- -m 64k TCP STREAM TEST from 192.168.0.7 (192.168.0.7) port 0 AF_INET to 192.168.0.6 (192.168.0.6) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 64000 10.01 2661.88 output of mpstat on the reception side. (one core is saturated as expected) $ mpstat -P ALL 2 1 Linux 2.6.29.1 (labospc47) 04/28/2009 _x86_64_ 01:59:23 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 01:59:25 PM all 0.03 0.00 1.75 0.00 0.03 4.70 0.00 93.48 7883.08 01:59:25 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 3.48 01:59:25 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 3436.32 01:59:25 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 5 0.50 0.00 26.87 0.00 0.50 72.14 0.00 0.00 4112.94 01:59:25 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 13 0.00 0.00 0.50 0.00 0.00 0.00 0.00 99.50 1.00 01:59:25 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 01:59:25 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 For eight simultaneous netperf instances, The max throughput I got is around 8100Mb/s. The output of mpstat on the reception side is as follows. (As you can see, CPU is not a bottleneck now.) mpstat -P ALL 2 1 Linux 2.6.29.1 (labospc47) 04/28/2009 _x86_64_ 02:13:03 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 02:13:05 PM all 0.00 0.00 4.83 0.00 0.41 14.45 0.00 80.31 109575.00 02:13:05 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 7946.50 02:13:05 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:13:05 PM 2 0.00 0.00 10.40 0.00 0.99 22.28 0.00 66.34 6244.50 02:13:05 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:13:05 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 13528.50 02:13:05 PM 5 0.00 0.00 7.18 0.00 0.00 17.44 0.00 75.38 4888.50 02:13:05 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 12933.00 02:13:05 PM 7 0.00 0.00 8.49 0.00 0.47 21.23 0.00 69.81 5171.00 02:13:05 PM 8 0.00 0.00 9.22 0.00 0.97 26.70 0.00 63.11 13533.00 02:13:05 PM 9 0.00 0.00 6.77 0.00 0.00 58.85 0.00 34.38 4155.50 02:13:05 PM 10 0.00 0.00 6.35 0.00 0.53 22.22 0.00 70.90 5410.50 02:13:05 PM 11 0.00 0.00 0.45 0.00 0.00 0.91 0.00 98.64 1.00 02:13:05 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 12707.00 02:13:05 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:13:05 PM 14 0.00 0.00 10.50 0.00 0.46 14.16 0.00 74.89 6047.00 02:13:05 PM 15 0.00 0.00 10.75 0.00 2.69 29.57 0.00 56.99 6272.50 2) mtu == 9000 For one single netperf instance, output of netperf on the transmisstion side: $ netperf -H 192.168.0.6 -p 10000 -L 192.168.0.7 -l 10 -t TCP_STREAM -- -m 64k TCP STREAM TEST from 192.168.0.7 (192.168.0.7) port 0 AF_INET to 192.168.0.6 (192.168.0.6) port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 64000 10.00 7024.52 output of mpstat on the reception side. (one core is saturated as expected) $ mpstat -P ALL 2 1 Linux 2.6.29.1 (labospc47) 04/28/2009 _x86_64_ 02:16:30 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 02:16:32 PM all 0.00 0.00 3.47 0.00 0.02 1.49 0.00 95.02 18355.72 02:16:32 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 2.99 02:16:32 PM 1 0.00 0.00 0.50 0.00 0.00 0.00 0.00 99.50 1.00 02:16:32 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 9986.57 02:16:32 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 12 0.00 0.00 69.50 0.00 0.50 30.00 0.00 0.00 7973.63 02:16:32 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:16:32 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 For eight simultaneous netperf instances, The max throughput I got is around 8800Mb/s. The output of mpstat on the reception side is as follows. (As you can see, CPU is not a bottleneck now.) $ mpstat -P ALL 2 1 Linux 2.6.29.1 (labospc47) 04/28/2009 _x86_64_ 02:19:16 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 02:19:18 PM all 0.00 0.00 3.47 0.00 0.06 1.77 0.00 94.70 74183.58 02:19:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 5.47 02:19:18 PM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 6365.17 02:19:18 PM 2 0.00 0.00 0.50 0.00 0.00 0.00 0.00 99.50 7604.98 02:19:18 PM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:19:18 PM 4 0.00 0.00 10.28 0.00 0.00 2.80 0.00 86.92 5264.18 02:19:18 PM 5 0.00 0.00 4.74 0.00 0.00 4.74 0.00 90.52 4534.33 02:19:18 PM 6 0.00 0.00 7.80 0.00 0.00 3.41 0.00 88.78 6463.18 02:19:18 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:19:18 PM 8 0.00 0.00 7.08 0.00 0.47 6.13 0.00 86.32 6546.77 02:19:18 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 15397.01 02:19:18 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:19:18 PM 11 0.00 0.00 7.39 0.00 0.00 2.46 0.00 90.15 4272.64 02:19:18 PM 12 0.00 0.00 7.92 0.00 0.00 4.95 0.00 87.13 5086.07 02:19:18 PM 13 0.00 0.00 7.33 0.00 0.52 3.66 0.00 88.48 4922.39 02:19:18 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1.00 02:19:18 PM 15 0.00 0.00 8.81 0.00 0.00 3.11 0.00 88.08 7460.70 2009/4/27 Brandeburg, Jesse <jes...@in...> > > > On Sat, 25 Apr 2009, Jiaqing Du wrote: > > > Hi, List > > > > I'm playing with two Intel 10 Gigabit CX4 Dual Port Server Adapters > located > > at two different machines and connected by network cables directly. These > > two cards sit on two x8 PCIe slots. > > > > The tuning I did includes setting interrupt affinity and process > affinity, > > using jumbo frames, and changing related kernel network parameters. > > Basically, I referred to the following two documents. > > > > 1) 10 Gb Ethernet, > > www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf > > 2) Linux kernel documentation, > > http://www.mjmwired.net/kernel/Documentation/networking/ixgb.txt > > > > For one pair of ports located at two different machines, the maximum TCP > > throughput I could get is about 8900Mb/s. This number just comes from > > setting interrupt affinity and using jumbo frames. Changing other parts > of > > the system configuration could not give me a better number. > > as measured by the application? what application? how many threads were > you running and was it TCP or UDP? What did you set your interrupt rate > to? How many receive queues? What kernel? I am missing lots of data :-) > Also, what kind of system, and how fast is your memory? > > netperf test to localhost is a decent indication of memory bandwidth. > netserver > netperf -T0,0 -C -c > netperf -T0,1 -C -c > netperf -T0,2 -C -c > > > > Can I get a higher throughput with more tuning or this is what I can get > > from CX4 adapters? > > you should always be able to achieve 9.4Gb/s (with 1500 MTU) using netperf > and the result should be higher with jumbo frames enabled. > > The CX4 interconnect makes no difference to the throughput. > > a typical netperf run I use is > > remote: netserver > local: netperf -H <remote> -l 30 -t TCP_MAERTS -C -c -- -m 64K > > > |