#7 Tx hang on 82599

closed
ixgbe (40)
in-kernel_driver
5
2015-01-11
2011-10-06
ilyaminkin
No

The setup:

Intel 520X - dual 82599 used for custom GRE decapsulation. All GRE encapsulated packets are received on eth0, decapsulated and transmitted out of eth1. Linux ip_gre has been modified to 1) not require any GRE tunnel setup and 2) to bypass all routing. Effectively ip_gre decapsualtes GRE packets and calls this:

skb->dev = dev_eth1;
dev_queue_xmit(skb);

eth0 is connected to 1G port on a switch, eth1 is connected to either 1G or 10G port on a switch, the results were the same.

There is no reverse traffic, i.e. eth0 always receives and eth1 always transmits.

When running the test there are continuous Tx hang error messages:

[88052.218475] ixgbe 0000:02:00.0: eth1: Detected Tx Unit Hang
[88052.218479] Tx Queue <1>
[88052.218481] TDH, TDT <0>, <6>
[88052.218483] next_to_use <6>
[88052.218484] next_to_clean <0>
[88052.218491] ixgbe 0000:02:00.0: eth1: Detected Tx Unit Hang
[88052.218494] Tx Queue <4>
[88052.218495] TDH, TDT <81>, <5b>
[88052.218498] next_to_use <5b>
[88052.218500] next_to_clean <81>
[88052.218503] ixgbe 0000:02:00.0: eth1: tx_buffer_info[next_to_clean]
[88052.218505] time_stamp <14f3fa5>
[88052.218506] jiffies <14f43a0>
[88052.218509] ixgbe 0000:02:00.0: eth1: tx hang 1777 detected on queue 4, resetting adapter
[88052.218520] ixgbe 0000:02:00.0: eth1: Reset adapter
[88052.219469] ixgbe 0000:02:00.0: eth1: tx_buffer_info[next_to_clean]
[88052.219470] time_stamp <14f4059>
[88052.219471] jiffies <14f43a0>
[88052.219663] ixgbe 0000:02:00.0: eth1: tx hang 1778 detected on queue 1, resetting adapter
[88052.757126] ixgbe 0000:02:00.0: eth1: detected SFP+: 5
[88053.523999] ixgbe 0000:02:00.0: eth1: NIC Link is Up 1 Gbps, Flow Control: RX/TX

kernel version: 2.6.32
ixgbe driver version: 3.5.14 built with LRO disabled as per README
all offload is disabled

One last thing, the problem is easily reproducible in 'live' setup at a relatively low ingress rate ~100 Mbps containing mostly TCP/IP flows (SMB traffic between Windows machines and some random Windows noise traffic) encapsulated in GRE. The problem is not reproducible in Spirent setup where UDP packets are encapsulated in GRE. The Spirent setup successfully runs all the way to 10G, also tested random packet sizes, random IP/UDP payload IP addresses and ports and random network utilization. All ran without problems.

ethtool output is below:

root@gre10g:~# uname -a
Linux gre10g 2.6.32-no-gre #1 SMP Sun Jun 5 16:43:14 PDT 2011 i686 GNU/Linux
root@gre10g:~# ethtool -i eth0
driver: ixgbe
version: 3.5.14-NAPI
firmware-version: 0x02088000
bus-info: 0000:02:00.1
root@gre10g:~# ethtool -i eth1
driver: ixgbe
version: 3.5.14-NAPI
firmware-version: 0x02088000
bus-info: 0000:02:00.0
root@gre10g:~#
root@gre10g:~# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off
root@gre10g:~# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
ntuple-filters: off
receive-hashing: off
root@gre10g:~#
root@gre10g:~#
root@gre10g:~# ethtool -S eth0
NIC statistics:
rx_packets: 2205663126
tx_packets: 1397
rx_bytes: 3849203840
tx_bytes: 93138
rx_errors: 80
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 3
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 2205650184
tx_pkts_nic: 1397
rx_bytes_nic: 1661929810120
tx_bytes_nic: 112748
lsc_int: 7
tx_busy: 0
non_eop_descs: 0
broadcast: 219602
rx_no_buffer_count: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 24
rx_short_length_errors: 0
tx_flow_control_xon: 1
rx_flow_control_xon: 4
tx_flow_control_xoff: 2
rx_flow_control_xoff: 5
rx_csum_offload_errors: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
rx_no_dma_resources: 0
hw_rsc_aggregated: 0
hw_rsc_flushed: 0
fdir_match: 365
fdir_miss: 561960060
fdir_overflow: 0
fcoe_bad_fccrc: 0
fcoe_last_errors: 0
rx_fcoe_dropped: 0
rx_fcoe_packets: 0
rx_fcoe_dwords: 0
tx_fcoe_packets: 0
tx_fcoe_dwords: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_queue_0_packets: 5
tx_queue_0_bytes: 1898
tx_queue_1_packets: 10
tx_queue_1_bytes: 2080
tx_queue_2_packets: 0
tx_queue_2_bytes: 0
tx_queue_3_packets: 0
tx_queue_3_bytes: 0
tx_queue_4_packets: 264
tx_queue_4_bytes: 15088
tx_queue_5_packets: 230
tx_queue_5_bytes: 14906
tx_queue_6_packets: 879
tx_queue_6_bytes: 58620
tx_queue_7_packets: 9
tx_queue_7_bytes: 546
rx_queue_0_packets: 1251215188
rx_queue_0_bytes: 808401561828
rx_queue_1_packets: 53710
rx_queue_1_bytes: 17425550
rx_queue_2_packets: 9440
rx_queue_2_bytes: 886794
rx_queue_3_packets: 2380
rx_queue_3_bytes: 280888
rx_queue_4_packets: 954359219
rx_queue_4_bytes: 844683330118
rx_queue_5_packets: 22288
rx_queue_5_bytes: 12999970
rx_queue_6_packets: 373
rx_queue_6_bytes: 36053
rx_queue_7_packets: 528
rx_queue_7_bytes: 124303
root@gre10g:~# ethtool -S eth1
NIC statistics:
rx_packets: 1754259
tx_packets: 14400227
rx_bytes: 1127845084
tx_bytes: 2250276478
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
multicast: 0
collisions: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
rx_pkts_nic: 1753991
tx_pkts_nic: 14395336
rx_bytes_nic: 1134777667
tx_bytes_nic: 16502568810
lsc_int: 2315
tx_busy: 0
non_eop_descs: 0
broadcast: 11458
rx_no_buffer_count: 0
tx_timeout_count: 1740
tx_restart_queue: 63090
rx_long_length_errors: 0
rx_short_length_errors: 0
tx_flow_control_xon: 22
rx_flow_control_xon: 1187173
tx_flow_control_xoff: 44
rx_flow_control_xoff: 1206650
rx_csum_offload_errors: 0
alloc_rx_page_failed: 0
alloc_rx_buff_failed: 0
rx_no_dma_resources: 0
hw_rsc_aggregated: 0
hw_rsc_flushed: 0
fdir_match: 0
fdir_miss: 383318
fdir_overflow: 0
fcoe_bad_fccrc: 0
fcoe_last_errors: 0
rx_fcoe_dropped: 0
rx_fcoe_packets: 0
rx_fcoe_dwords: 0
tx_fcoe_packets: 0
tx_fcoe_dwords: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_queue_0_packets: 8150255
tx_queue_0_bytes: 8319741099
tx_queue_1_packets: 1845
tx_queue_1_bytes: 548950
tx_queue_2_packets: 0
tx_queue_2_bytes: 0
tx_queue_3_packets: 0
tx_queue_3_bytes: 0
tx_queue_4_packets: 6248064
tx_queue_4_bytes: 11110042933
tx_queue_5_packets: 79
tx_queue_5_bytes: 6342
tx_queue_6_packets: 0
tx_queue_6_bytes: 0
tx_queue_7_packets: 0
tx_queue_7_bytes: 0
rx_queue_0_packets: 1747764
rx_queue_0_bytes: 1124793474
rx_queue_1_packets: 315
rx_queue_1_bytes: 87261
rx_queue_2_packets: 506
rx_queue_2_bytes: 47307
rx_queue_3_packets: 114
rx_queue_3_bytes: 13928
rx_queue_4_packets: 4312
rx_queue_4_bytes: 2204062
rx_queue_5_packets: 1207
rx_queue_5_bytes: 690877
rx_queue_6_packets: 0
rx_queue_6_bytes: 0
rx_queue_7_packets: 41
rx_queue_7_bytes: 8175

I don't have the output of
cat /proc/interrupt
handy but I observed that interrupt counts stop incrementing when TX hang is detected.

I also tried with both default smp_affinity (0xff) and core per queue affinity as set by set_irq_affinity.sh scrip. The results were the same.

Discussion

  • Don Skidmore

    Don Skidmore - 2011-10-06

    Thanks for the detailed problem report.
    It interest me that you only see the failure with (as you put it) “live” traffic. One thing that jumped into my mind was could there be some non-GRE encapsulated traffic. I imagine that might cause issues with some packets being routed while other being forced to ip_gre. Could you send us a patch with the changes you made?
    Likewise it would be helpful to see the ethtool –k output. I also noticed that you have flow director enabled which probably won’t do you much good since you primarily dealing with encapsulated flows.

     
  • ilyaminkin

    ilyaminkin - 2011-10-06
     
  • ilyaminkin

    ilyaminkin - 2011-10-06

    Thank you for a quick response.

    ethtool -k output is there, just hiding in the middle :)

    routing - I doubt it, eth0 is not in promisc mode, routing is not enabled and eth1 does not have IP address assigned

    live traffic - I'm getting a setup with Linux VMs instead of Windows VMs. I wonder if some of Windows chatter is causing problems when forwarded & decapsulated. I will post an update when I have the data.

    Flow Director - since it is enabled, RX will hash ingress packets to determine ingress queues, right? I noticed that when I have multiple GRE tunnels, (same DIP different SIPs) I see multiple queues being used.

    GRE patch is attached.

     
  • Don Skidmore

    Don Skidmore - 2011-10-07

    Sorry didn’t notice the ethtool –k output on my first pass, in my defense you made one very detailed report. :)

    I see your point about the routing if you don’t even have an IP address you can’t expect much in the way of traffic sneaking in. I was just concerned that there might be issues with circumventing other parts of the stack for non-GRE traffic but it looks like your isolating it rather pretty well.

    The issue that keeps confusing me is that you don’t see the Tx Hang with Spirent produced traffic while you do with ‘live’ traffic. The driver shouldn’t care either way? Another item that jumped out at me was there seems to be a fair amount of RX FC on eth1 adapter. This is surprising considering you’re only getting ~100 Mbps. Is the data going out in bursts?

     
  • ilyaminkin

    ilyaminkin - 2011-10-08

    Did one more test, connected eth1 to the identical 520X port in the second unit. Wanted eliminate any possible weirdness due to different SFP, etc. Same error.

    Below are the counters from a short run.

    Note FC counters are 0. I think previous counters were > 0 because I was messing with the setup for a while.

    ethtool -S eth0
    NIC statistics:
    rx_packets: 3023418
    tx_packets: 42
    rx_bytes: 3579009778
    tx_bytes: 4472
    rx_errors: 0
    tx_errors: 0
    rx_dropped: 0
    tx_dropped: 0
    multicast: 0
    collisions: 0
    rx_over_errors: 0
    rx_crc_errors: 0
    rx_frame_errors: 0
    rx_fifo_errors: 0
    rx_missed_errors: 0
    tx_aborted_errors: 0
    tx_carrier_errors: 0
    tx_fifo_errors: 0
    tx_heartbeat_errors: 0
    rx_pkts_nic: 3023418
    tx_pkts_nic: 42
    rx_bytes_nic: 3591103450
    tx_bytes_nic: 5036
    lsc_int: 1
    tx_busy: 0
    non_eop_descs: 0
    broadcast: 14133
    rx_no_buffer_count: 0
    tx_timeout_count: 0
    tx_restart_queue: 0
    rx_long_length_errors: 0
    rx_short_length_errors: 0
    tx_flow_control_xon: 0
    rx_flow_control_xon: 0
    tx_flow_control_xoff: 0
    rx_flow_control_xoff: 0
    rx_csum_offload_errors: 0
    alloc_rx_page_failed: 0
    alloc_rx_buff_failed: 0
    rx_no_dma_resources: 0
    hw_rsc_aggregated: 0
    hw_rsc_flushed: 0
    fdir_match: 0
    fdir_miss: 477472
    fdir_overflow: 0
    fcoe_bad_fccrc: 0
    fcoe_last_errors: 0
    rx_fcoe_dropped: 0
    rx_fcoe_packets: 0
    rx_fcoe_dwords: 0
    tx_fcoe_packets: 0
    tx_fcoe_dwords: 0
    os2bmc_rx_by_bmc: 0
    os2bmc_tx_by_bmc: 0
    os2bmc_tx_by_host: 0
    os2bmc_rx_by_host: 0
    tx_queue_0_packets: 0
    tx_queue_0_bytes: 0
    tx_queue_1_packets: 6
    tx_queue_1_bytes: 468
    tx_queue_2_packets: 0
    tx_queue_2_bytes: 0
    tx_queue_3_packets: 0
    tx_queue_3_bytes: 0
    tx_queue_4_packets: 0
    tx_queue_4_bytes: 0
    tx_queue_5_packets: 10
    tx_queue_5_bytes: 1606
    tx_queue_6_packets: 26
    tx_queue_6_bytes: 2398
    tx_queue_7_packets: 0
    tx_queue_7_bytes: 0
    rx_queue_0_packets: 65949
    rx_queue_0_bytes: 34181802
    rx_queue_1_packets: 0
    rx_queue_1_bytes: 0
    rx_queue_2_packets: 504
    rx_queue_2_bytes: 47274
    rx_queue_3_packets: 142
    rx_queue_3_bytes: 17329
    rx_queue_4_packets: 2955602
    rx_queue_4_bytes: 3544059244
    rx_queue_5_packets: 1193
    rx_queue_5_bytes: 697105
    rx_queue_6_packets: 0
    rx_queue_6_bytes: 0
    rx_queue_7_packets: 28
    rx_queue_7_bytes: 7024

    ethtool -S eth1

    NIC statistics:
    rx_packets: 9
    tx_packets: 270292
    rx_bytes: 1070
    tx_bytes: 504052274
    rx_errors: 0
    tx_errors: 0
    rx_dropped: 0
    tx_dropped: 0
    multicast: 0
    collisions: 0
    rx_over_errors: 0
    rx_crc_errors: 0
    rx_frame_errors: 0
    rx_fifo_errors: 0
    rx_missed_errors: 0
    tx_aborted_errors: 0
    tx_carrier_errors: 0
    tx_fifo_errors: 0
    tx_heartbeat_errors: 0
    rx_pkts_nic: 9
    tx_pkts_nic: 270292
    rx_bytes_nic: 1106
    tx_bytes_nic: 324843906
    lsc_int: 94
    tx_busy: 0
    non_eop_descs: 0
    broadcast: 9
    rx_no_buffer_count: 0
    tx_timeout_count: 72
    tx_restart_queue: 100
    rx_long_length_errors: 0
    rx_short_length_errors: 0
    tx_flow_control_xon: 0
    rx_flow_control_xon: 0
    tx_flow_control_xoff: 0
    rx_flow_control_xoff: 0
    rx_csum_offload_errors: 0
    alloc_rx_page_failed: 0
    alloc_rx_buff_failed: 0
    rx_no_dma_resources: 0
    hw_rsc_aggregated: 0
    hw_rsc_flushed: 0
    fdir_match: 0
    fdir_miss: 1
    fdir_overflow: 0
    fcoe_bad_fccrc: 0
    fcoe_last_errors: 0
    rx_fcoe_dropped: 0
    rx_fcoe_packets: 0
    rx_fcoe_dwords: 0
    tx_fcoe_packets: 0
    tx_fcoe_dwords: 0
    os2bmc_rx_by_bmc: 0
    os2bmc_tx_by_bmc: 0
    os2bmc_tx_by_host: 0
    os2bmc_rx_by_host: 0
    tx_queue_0_packets: 6881
    tx_queue_0_bytes: 4200864
    tx_queue_1_packets: 0
    tx_queue_1_bytes: 0
    tx_queue_2_packets: 0
    tx_queue_2_bytes: 0
    tx_queue_3_packets: 0
    tx_queue_3_bytes: 0
    tx_queue_4_packets: 263411
    tx_queue_4_bytes: 499851410
    tx_queue_5_packets: 0
    tx_queue_5_bytes: 0
    tx_queue_6_packets: 0
    tx_queue_6_bytes: 0
    tx_queue_7_packets: 0
    tx_queue_7_bytes: 0
    rx_queue_0_packets: 8
    rx_queue_0_bytes: 480
    rx_queue_1_packets: 0
    rx_queue_1_bytes: 0
    rx_queue_2_packets: 0
    rx_queue_2_bytes: 0
    rx_queue_3_packets: 0
    rx_queue_3_bytes: 0
    rx_queue_4_packets: 0
    rx_queue_4_bytes: 0
    rx_queue_5_packets: 1
    rx_queue_5_bytes: 590
    rx_queue_6_packets: 0
    rx_queue_6_bytes: 0
    rx_queue_7_packets: 0
    rx_queue_7_bytes: 0

     
  • ilyaminkin

    ilyaminkin - 2011-10-11

    Any update? If there is some debug code that you'd like to run let me know.

     
  • Jacob Keller

    Jacob Keller - 2012-05-04

    Please respond if this is still an issue, The bug is over 6 months old. If there is no response in 60 days the bug will be automatically closed.

     
  • Todd Fujinaka

    Todd Fujinaka - 2013-07-09
    • status: pending --> closed
    • Group: --> in-kernel_driver
     
  • Todd Fujinaka

    Todd Fujinaka - 2013-07-09

    Closed due to inactivity.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks