[E1000-devel] Transmit queueing, resulting high latency and speed drops

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I have been investigating this one for weeks now, yesterday I thought I've
finally figured it out, seems I was wrong, so here it comes, final try, emailing
the developers. :)

Kernel is vanilla 2.6.17.11 SMP. [I hope it's not something fixed in kernel in
the meantime... :-/]

Driver is  7.0.33-k2-NAPI

The server is an intel etherepress based board, strong cpus, plenty of memory,
etc. Contains 4 cards, of which two is relevant:
03:02.0 Ethernet controller: Intel Corporation 82545GM Gigabit Ethernet
Controller (rev 04)
06:01.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet
Controller (rev 05)

First one is an external PCI-X:133Mhz:64bit, other is integrated onboard, lspci
thinks it is the same speed, e1000 drivr thinks
 e1000: 0000:06:01.0: e1000_probe: (PCI:66MHz:32-bit)

The traffic goes through these cards, mainly in the first and out the second
(200mbps) and slightly less traffic the other way (70mbps).

When traffic reaches approx. 32000 packets/sec (I fear it might be 32768
pkts/sec which is always a bad omen) the first card starts to queue outgoing
traffic (while the second card does not). The queueing is visible in the linux
queueing:

qdisc tbf 8007: dev eth0 rate 1000Mbit burst 32750b/8 mpu 0b lat 20.0ms 
 Sent 22893578821 bytes 68342382 pkt (dropped 1, overlimits 61 requeues 299611) 
 rate 85935Kbit 32380pps backlog 0b 449p requeues 299611 
qdisc tbf 8005: dev eth1 rate 1000Mbit burst 32750b/8 mpu 0b lat 20.0ms 
 Sent 681354341621 bytes 937627172 pkt (dropped 12523, overlimits 127208
requeues 1853808) 
 rate 227450Kbit 34288pps backlog 0b 0p requeues 1853808 

(tbf was selected only to be able to measure the traffic, rate is same as link
capacity to prevent overlimits; the queueing happens with pfifo_fast too.)

First we noticed the cards were generating 8k irq/sec, thought the limiting
causes this, so I fiddled with the irq limits, and set it to:
e1000 TxIntDelay=25,25,25 TxAbsIntDelay=256,256,256 RxIntDelay=64,64,64
RxAbsIntDelay=256,256,256

This reduced interrupt rate to 2k-5k, but it still queues at the same rate.

I do not belive it's a hardware issue (can't think of any which would cause
this) but I cannot further come up with ideas what could cause the packets to
queue up.

The stats doesn't tell me anything helpful either:
# ethtool -S eth0
NIC statistics:
     rx_packets: 1212685382
     tx_packets: 938711270
     rx_bytes: 750835435
     tx_bytes: 3250721930
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 0
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 8083597
     rx_missed_errors: 52373
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 3991825
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 4
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 141102752
     tx_flow_control_xon: 129374
     tx_flow_control_xoff: 143188
     rx_long_byte_count: 825384556267
     rx_csum_offload_good: 1208095761
     rx_csum_offload_errors: 37281
     rx_header_split: 0
     alloc_rx_buff_failed: 0

So I'm clueless here. Any help would be very much appreciated. Emailing me
personally or CC'ing would be appreciated either. I can naturally provide eprom
dumps, lspci -vv's or anything long, if it helps. 

Thanks,
Peter

[E1000-devel] Transmit queueing, resulting high latency and speed drops

Moved to github.com/intel

[E1000-devel] Transmit queueing, resulting high latency and speed drops