We have a scenario where two hosts are connected back-to-back with 10Gbps intel 82599 optical interfaces. In our test, one host (HOST1) generates a 9999Mbps 64-byte flow (Pktgen traffic generator) to the other (HOST2) which runs the packet_loopback.c openEM sample app. This app sends every packet back to Pktgen in HOST1. The throughput measured in Pktgen (RX) was only 6.1Gbps even though it has received 10Gbps. Furthermore, we could achieve 9999Mbps running DPDK app directly and PKTGEN<-->PKTGEN.
Questions:
1) Why is the obtained performance in openEM so lower than in DPDK and Pktgen direct connections?
2) Is it possible to separate cores (from other processing inside openEM environment) for processing events related to receiving packets at interface level (pooling the interface)?
We suspect that are ocurring a lot of packet drops due to lack of cores configured to process packet at interface level? (low pooling frequency).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your results seem quite low. I ran a quick test and was able to reach over 90% of line rate using four cores (Intel Xeon E5-2697 v3 @ 2.60GHz). With direct dispatch enabled I got ~99% of line rate with just a single core.
Is PKTGEN generating packets from all source addresses (4) and ports (64)? If the test packets have identical addresses and ports they are handled as a single flow and end up in the same queue in the NIC, which will become a bottleneck.
Could you provide the used configuration options for the packet_loopback test so I could run the same test in our lab?
Regards,
Matias
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We have used multiple flows in our tests.
We got ~99% of line rate when we used 1Gbps interfaces. However, our tests were not able to reach a higher rate than 7.2Gbps with 10Gbps interfaces when we used small packages (64 bytes).
These are the results of our tests:
test | packet size | direct dispatch | throughput
1 | 64 bytes | off | 6.1 Gbps
2 | 64 bytes | on | 7.2 Gbps
3 | 128 bytes | off | 7.8 Gbps
4 | 128 bytes | on | 9.9 Gbps
5 | 256 bytes | off | 9.9 Gbps
6 | 256 bytes | on | 9.9 Gbps
Could you tell me the line rate of your interfaces and the packet size that you used in your test.
I used 10 Gbps NIC (82599) and 64 byte packets. Your configuration seems to be same as mine, but I'll verify this on Monday when I get back to the lab.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm still unable to reproduce your test results. Could you please provide the text output of packet_loopback example, your PKTGEN configuration, and a description of your hardware setup.
-Matias
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
config.pkt:
set mac 0 8c:dc:d4:a9:b5:3c
set ip dst 0 192.168.17.1
set ip src 0 192.168.17.2/24
range 0 enable
src.type ipv4 0
src.proto udp 0
dst.type ipv4 0
dst.proto udp 0
dst.mac start 0 8c:dc:d4:a9:b5:3c
dst.mac min 0 8c:dc:d4:a9:b5:3c
dst.mac max 0 8c:dc:d4:a9:b5:3c
dst.mac inc 0 00:00:00:00:00:00
src.mac start 0 8c:dc:d4:a9:b9:14
src.mac min 0 8c:dc:d4:a9:b9:14
src.mac max 0 8c:dc:d4:a9:b9:4f
src.mac inc 0 00:00:00:00:00:01
src.ip start 0 192.168.17.99
src.ip min 0 192.168.17.99
src.ip max 0 192.168.17.99
src.ip inc 0 0.0.0.0
dst.ip start 0 192.168.17.1
dst.ip min 0 192.168.17.1
dst.ip max 0 192.168.17.4
dst.ip inc 0 0.0.0.1
src.port start 0 5678
src.port min 0 5678
src.port max 0 5678
src.port inc 0 1
dst.port start 0 1234
dst.port min 0 1234
dst.port max 0 1297
dst.port inc 0 1
pkt.size start 0 64
pkt.size min 0 64
pkt.size max 0 64
pkt.size inc 0 0
Obs: We modified the PKTGEN default transport protocol set by PKTGEN for generating UDP traffic, because even using "src.proto udp 0" and "dst.proto udp 0" the multiple flows generated were TCP flows, as observed in wireshark in the other host. FILE in PKTGEN: pktgen-cmds.c. Function: pktgen_port_defaults(). pkt->ipProto = PG_IPPROTO_UDP.
3) text output of packet_loopback:
Coremask: 0x3
Core Count: 10
Process-per-core mode selected!
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 5 on socket 0
EAL: Detected lcore 6 as core 8 on socket 0
EAL: Detected lcore 7 as core 9 on socket 0
EAL: Detected lcore 8 as core 10 on socket 0
EAL: Detected lcore 9 as core 11 on socket 0
EAL: Detected lcore 10 as core 12 on socket 0
EAL: Detected lcore 11 as core 13 on socket 0
EAL: Support maximum 64 logical core(s) by configuration.
EAL: Detected 12 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Setting up memory...
EAL: Ask a virtual area of 0x200000000 bytes
EAL: Virtual area found at 0x7fb2c0000000 (size = 0x200000000)
EAL: Requesting 8 pages of size 1024MB from socket 0
EAL: TSC frequency is ~2394231 KHz
EAL: Master core 0 is ready (tid=46b18a80)
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.0 not managed by VFIO driver, skipping
EAL: 0000:03:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.1 not managed by VFIO driver, skipping
EAL: 0000:03:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.2 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.2 not managed by VFIO driver, skipping
EAL: 0000:03:00.2 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.3 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.3 not managed by VFIO driver, skipping
EAL: 0000:03:00.3 not managed by UIO driver, skipping
EAL: PCI device 0000:07:00.0 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.0 not managed by VFIO driver, skipping
EAL: PCI memory mapped at 0x7fb546a16000
EAL: PCI memory mapped at 0x7fb546b20000
EAL: PCI device 0000:07:00.1 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.1 not managed by VFIO driver, skipping
EAL: 0000:07:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.0 not managed by VFIO driver, skipping
EAL: 0000:03:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.1 not managed by VFIO driver, skipping
EAL: 0000:03:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.2 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.2 not managed by VFIO driver, skipping
EAL: 0000:03:00.2 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.3 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.3 not managed by VFIO driver, skipping
EAL: 0000:03:00.3 not managed by UIO driver, skipping
EAL: PCI device 0000:07:00.1 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.1 not managed by VFIO driver, skipping
EAL: 0000:07:00.1 not managed by UIO driver, skipping
em_init_global() on EM-core00 (lcore 00)
queue init
eo alloc init
event group init
atomic group init
sched_init_global_2():
Initialize SchedQs:
Atomic SchedQs... done.
Parallel SchedQs... done.
Parallel-Ordered SchedQs... done.
eth_init(): Eth ports:1 Cores:10
Eth dev info - port 0
driver_name = rte_ixgbe_pmd
min_rx_bufsize = 1024
max_rx_pktlen = 15872
max_rx_queues = 128
max_tx_queues = 128
Initializing Eth port 0 RxQs:11 TxQs:42 MAC:8C:DC:D4:A9:B5:3C done: Link Up - speed 10000 Mbps - full-duplex
em_init_local() on em-core 1
em_init_local() on em-core 8
em_init_local() on em-core 9
em_init_local() on em-core 0
em_init_local() on em-core 4
em_init_local() on em-core 7
em_init_local() on em-core 2
em_init_local() on em-core 6
em_init_local() on em-core 3
==========================
EM Info on target: OpenEM-Intel-DPDK
EM API version: v1.1, 64 bit
Cache Line size = 64 B
em_queue_element_t = 192 B
em_queue_element_t.lock = 64 B
==========================
EM APPLICATION: 'packet_loopback' initializing:
packet_loopback.c: test_start() - EM-core:0
Application running on 10 EM-cores (procs:10, threads:10).
EO 1:packet_loopback global start.
IP:port -> Q 192.168.17.1:1234 -> 353
IP:port -> Q 192.168.17.4:1297 -> 608
EO 1 global start done.
Entering the event dispatch loop() on EM-core 0
EO 1:packet_loopback local start on EM-core0
Entering the event dispatch loop() on EM-core 5
EO 1:packet_loopback local start on EM-core5
Entering the event dispatch loop() on EM-core 1
Entering the event dispatch loop() on EM-core 8
Entering the event dispatch loop() on EM-core 2
Entering the event dispatch loop() on EM-core 3
Entering the event dispatch loop() on EM-core 7
EO 1:packet_loopback local start on EM-core3
Entering the event dispatch loop() on EM-core 6
EO 1:packet_loopback local start on EM-core1
Entering the event dispatch loop() on EM-core 9
Entering the event dispatch loop() on EM-core 4
EO 1:packet_loopback local start on EM-core8
EO 1:packet_loopback local start on EM-core2
EO 1:packet_loopback local start on EM-core7
EO 1:packet_loopback local start on EM-core6
EO 1:packet_loopback local start on EM-core9
EO 1:packet_loopback local start on EM-core4
Last edit: Henrique 2015-03-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Seems that the 82599 NIC does not really like more rx-queues than 4 per port in RSS-mode, you encounter this when running with only one 10G IF and high core counts (e.g. (nb_lcores / nb_ports) + 1 with 14 cores gives 14/1+1=15 which at least in my setup does not give good results.
With n_rx_queue = 4 I get 99,7% with 1x10GE and 14 cores.
/carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
We have a scenario where two hosts are connected back-to-back with 10Gbps intel 82599 optical interfaces. In our test, one host (HOST1) generates a 9999Mbps 64-byte flow (Pktgen traffic generator) to the other (HOST2) which runs the packet_loopback.c openEM sample app. This app sends every packet back to Pktgen in HOST1. The throughput measured in Pktgen (RX) was only 6.1Gbps even though it has received 10Gbps. Furthermore, we could achieve 9999Mbps running DPDK app directly and PKTGEN<-->PKTGEN.
Questions:
1) Why is the obtained performance in openEM so lower than in DPDK and Pktgen direct connections?
2) Is it possible to separate cores (from other processing inside openEM environment) for processing events related to receiving packets at interface level (pooling the interface)?
We suspect that are ocurring a lot of packet drops due to lack of cores configured to process packet at interface level? (low pooling frequency).
Hi,
Your results seem quite low. I ran a quick test and was able to reach over 90% of line rate using four cores (Intel Xeon E5-2697 v3 @ 2.60GHz). With direct dispatch enabled I got ~99% of line rate with just a single core.
Is PKTGEN generating packets from all source addresses (4) and ports (64)? If the test packets have identical addresses and ports they are handled as a single flow and end up in the same queue in the NIC, which will become a bottleneck.
Could you provide the used configuration options for the packet_loopback test so I could run the same test in our lab?
Regards,
Matias
Hi,
We have used multiple flows in our tests.
We got ~99% of line rate when we used 1Gbps interfaces. However, our tests were not able to reach a higher rate than 7.2Gbps with 10Gbps interfaces when we used small packages (64 bytes).
These are the results of our tests:
test | packet size | direct dispatch | throughput
1 | 64 bytes | off | 6.1 Gbps
2 | 64 bytes | on | 7.2 Gbps
3 | 128 bytes | off | 7.8 Gbps
4 | 128 bytes | on | 9.9 Gbps
5 | 256 bytes | off | 9.9 Gbps
6 | 256 bytes | on | 9.9 Gbps
Could you tell me the line rate of your interfaces and the packet size that you used in your test.
Thanks for your reply.
Last edit: Henrique 2015-03-06
Hi,
I used 10 Gbps NIC (82599) and 64 byte packets. Your configuration seems to be same as mine, but I'll verify this on Monday when I get back to the lab.
Hi,
I'm still unable to reproduce your test results. Could you please provide the text output of packet_loopback example, your PKTGEN configuration, and a description of your hardware setup.
-Matias
Hi Matias,
Below, you can find the information:
1) Hardware setup:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
Stepping: 4
CPU MHz: 2394.108
BogoMIPS: 4788.21
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11
RAM: 64GBytes
INTERFACE: GBIC HP BLc 10Gb SR SFP+
HP Optical Ethernet 10Gb 2P 560SFP - 82599
2) PKTGEN Config:
./app/build/pktgen -c 0xff -n 4 --proc-type auto --file-prefix pg -- -p 0x1 -P -m "[1:3].0" -f config.pkt
config.pkt:
set mac 0 8c:dc:d4:a9:b5:3c
set ip dst 0 192.168.17.1
set ip src 0 192.168.17.2/24
range 0 enable
src.type ipv4 0
src.proto udp 0
dst.type ipv4 0
dst.proto udp 0
dst.mac start 0 8c:dc:d4:a9:b5:3c
dst.mac min 0 8c:dc:d4:a9:b5:3c
dst.mac max 0 8c:dc:d4:a9:b5:3c
dst.mac inc 0 00:00:00:00:00:00
src.mac start 0 8c:dc:d4:a9:b9:14
src.mac min 0 8c:dc:d4:a9:b9:14
src.mac max 0 8c:dc:d4:a9:b9:4f
src.mac inc 0 00:00:00:00:00:01
src.ip start 0 192.168.17.99
src.ip min 0 192.168.17.99
src.ip max 0 192.168.17.99
src.ip inc 0 0.0.0.0
dst.ip start 0 192.168.17.1
dst.ip min 0 192.168.17.1
dst.ip max 0 192.168.17.4
dst.ip inc 0 0.0.0.1
src.port start 0 5678
src.port min 0 5678
src.port max 0 5678
src.port inc 0 1
dst.port start 0 1234
dst.port min 0 1234
dst.port max 0 1297
dst.port inc 0 1
pkt.size start 0 64
pkt.size min 0 64
pkt.size max 0 64
pkt.size inc 0 0
Obs: We modified the PKTGEN default transport protocol set by PKTGEN for generating UDP traffic, because even using "src.proto udp 0" and "dst.proto udp 0" the multiple flows generated were TCP flows, as observed in wireshark in the other host. FILE in PKTGEN: pktgen-cmds.c. Function: pktgen_port_defaults(). pkt->ipProto = PG_IPPROTO_UDP.
3) text output of packet_loopback:
Coremask: 0x3
Core Count: 10
Process-per-core mode selected!
EAL: Detected lcore 0 as core 0 on socket 0
EAL: Detected lcore 1 as core 1 on socket 0
EAL: Detected lcore 2 as core 2 on socket 0
EAL: Detected lcore 3 as core 3 on socket 0
EAL: Detected lcore 4 as core 4 on socket 0
EAL: Detected lcore 5 as core 5 on socket 0
EAL: Detected lcore 6 as core 8 on socket 0
EAL: Detected lcore 7 as core 9 on socket 0
EAL: Detected lcore 8 as core 10 on socket 0
EAL: Detected lcore 9 as core 11 on socket 0
EAL: Detected lcore 10 as core 12 on socket 0
EAL: Detected lcore 11 as core 13 on socket 0
EAL: Support maximum 64 logical core(s) by configuration.
EAL: Detected 12 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: Setting up memory...
EAL: Ask a virtual area of 0x200000000 bytes
EAL: Virtual area found at 0x7fb2c0000000 (size = 0x200000000)
EAL: Requesting 8 pages of size 1024MB from socket 0
EAL: TSC frequency is ~2394231 KHz
EAL: Master core 0 is ready (tid=46b18a80)
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.0 not managed by VFIO driver, skipping
EAL: 0000:03:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.1 not managed by VFIO driver, skipping
EAL: 0000:03:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.2 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.2 not managed by VFIO driver, skipping
EAL: 0000:03:00.2 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.3 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.3 not managed by VFIO driver, skipping
EAL: 0000:03:00.3 not managed by UIO driver, skipping
EAL: PCI device 0000:07:00.0 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.0 not managed by VFIO driver, skipping
EAL: PCI memory mapped at 0x7fb546a16000
EAL: PCI memory mapped at 0x7fb546b20000
EAL: PCI device 0000:07:00.1 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.1 not managed by VFIO driver, skipping
EAL: 0000:07:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.0 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.0 not managed by VFIO driver, skipping
EAL: 0000:03:00.0 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.1 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.1 not managed by VFIO driver, skipping
EAL: 0000:03:00.1 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.2 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.2 not managed by VFIO driver, skipping
EAL: 0000:03:00.2 not managed by UIO driver, skipping
EAL: PCI device 0000:03:00.3 on NUMA socket 0
EAL: probe driver: 8086:1521 rte_igb_pmd
EAL: 0000:03:00.3 not managed by VFIO driver, skipping
EAL: 0000:03:00.3 not managed by UIO driver, skipping
EAL: PCI device 0000:07:00.1 on NUMA socket 0
EAL: probe driver: 8086:10fb rte_ixgbe_pmd
EAL: 0000:07:00.1 not managed by VFIO driver, skipping
EAL: 0000:07:00.1 not managed by UIO driver, skipping
em_init_global() on EM-core00 (lcore 00)
queue init
eo alloc init
event group init
atomic group init
sched_init_global_2():
Initialize SchedQs:
Atomic SchedQs... done.
Parallel SchedQs... done.
Parallel-Ordered SchedQs... done.
eth_init(): Eth ports:1 Cores:10
Eth dev info - port 0
driver_name = rte_ixgbe_pmd
min_rx_bufsize = 1024
max_rx_pktlen = 15872
max_rx_queues = 128
max_tx_queues = 128
Initializing Eth port 0 RxQs:11 TxQs:42 MAC:8C:DC:D4:A9:B5:3C done: Link Up - speed 10000 Mbps - full-duplex
em_init_local() on em-core 1
em_init_local() on em-core 8
em_init_local() on em-core 9
em_init_local() on em-core 0
em_init_local() on em-core 4
em_init_local() on em-core 7
em_init_local() on em-core 2
em_init_local() on em-core 6
em_init_local() on em-core 3
==========================
EM Info on target: OpenEM-Intel-DPDK
EM API version: v1.1, 64 bit
Cache Line size = 64 B
em_queue_element_t = 192 B
em_queue_element_t.lock = 64 B
==========================
Core mapping logic EM core -> phys core (lcore)
em_init_local() on em-core 5
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Queue groups
id name mask
0 core00 0x1
1 core01 0x2
2 core02 0x4
3 core03 0x8
4 core04 0x10
5 core05 0x20
6 core06 0x40
7 core07 0x80
8 core08 0x100
9 core09 0x200
31 default 0x3ff
EM APPLICATION: 'packet_loopback' initializing:
packet_loopback.c: test_start() - EM-core:0
Application running on 10 EM-cores (procs:10, threads:10).
EO 1:packet_loopback global start.
IP:port -> Q 192.168.17.1:1234 -> 353
IP:port -> Q 192.168.17.4:1297 -> 608
EO 1 global start done.
Entering the event dispatch loop() on EM-core 0
EO 1:packet_loopback local start on EM-core0
Entering the event dispatch loop() on EM-core 5
EO 1:packet_loopback local start on EM-core5
Entering the event dispatch loop() on EM-core 1
Entering the event dispatch loop() on EM-core 8
Entering the event dispatch loop() on EM-core 2
Entering the event dispatch loop() on EM-core 3
Entering the event dispatch loop() on EM-core 7
EO 1:packet_loopback local start on EM-core3
Entering the event dispatch loop() on EM-core 6
EO 1:packet_loopback local start on EM-core1
Entering the event dispatch loop() on EM-core 9
Entering the event dispatch loop() on EM-core 4
EO 1:packet_loopback local start on EM-core8
EO 1:packet_loopback local start on EM-core2
EO 1:packet_loopback local start on EM-core7
EO 1:packet_loopback local start on EM-core6
EO 1:packet_loopback local start on EM-core9
EO 1:packet_loopback local start on EM-core4
Last edit: Henrique 2015-03-09
Hi,
Could you try modifying in event_machine/intel/em_packet.c - em_init():Line422
Seems that the 82599 NIC does not really like more rx-queues than 4 per port in RSS-mode, you encounter this when running with only one 10G IF and high core counts (e.g. (nb_lcores / nb_ports) + 1 with 14 cores gives 14/1+1=15 which at least in my setup does not give good results.
With n_rx_queue = 4 I get 99,7% with 1x10GE and 14 cores.
/carl