#354 Intel PRO/1000 (82574L) randomly resets and breaks connections

closed
None
in-kernel_driver
1
2015-03-02
2012-08-22
Topi Mäenpää
No

I'm having serious trouble with Intel PRO/1000 network adapters and the e1000e Linux driver. I have tried every possible fix I can figure out and now I'm reaching to you for new ideas.

The PC (Lanner LEC-2220P) in question comes with two Intel NICs on board. We have installed a third card in a PCIe slot. The first one is connected to a 10/100 embedded device, the second one to PC with a 1Gb NIC, and the last one to a GigE camera (which produces data some 300Mbs at maximum). No switches in between. The problem occurs in a similar way in four identical systems. Thus, I believe we can rule out hardware failures.

The operating system in question is Debian Squeeze, running Linux kernel 2.6.32-5-amd64 by default. It comes with an old e1000e driver module (1.2.something).

Most of the time, the connections work fine. However, any of the three devices may go down after a seemingly random amount of time with no clear reason. What I see in system logs (/var/log/messages and /var/log/syslog) is this:

Aug 17 22:08:56 XXX kernel: [22144.179804] e1000e: eth0 NIC Link is Down
Aug 17 22:08:57 XXX kernel: [22145.806317] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Aug 17 22:08:57 XXX kernel: [22145.806321] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO

Sometimes, it is only one of the devices (in this case eth0), sometimes two, sometimes all three. The order in which the links go down follows no pattern. After the failure, the link seems to recover, but it makes the camera to lose some data, and its connection.

Another case, this time in /var/log/kern.log:

Aug 19 10:58:37 XXXX kernel: [10928.297751] e1000e: eth0 NIC Link is Down
Aug 19 10:58:37 XXXX kernel: [10928.306331] e1000e 0000:01:00.0: eth0: Reset adapter
Aug 19 10:58:38 XXXX kernel: [10929.007842] e1000e: eth2 NIC Link is Down
Aug 19 10:58:38 XXXX kernel: [10929.023839] e1000e: eth1 NIC Link is Down
Aug 19 10:58:39 XXXX kernel: [10930.063844] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Aug 19 10:58:39 XXXX kernel: [10930.063847] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
Aug 19 10:58:41 XXXX kernel: [10931.614458] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Aug 19 10:58:41 XXXX kernel: [10932.077233] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx

It took three seconds to bring eth2 back up.

In some (relatively rare) cases, I don't see the "Link is Down" message at all. Just "Link is Up".

I updated the e1000e driver to the latest one found on Intel's site (2.0.0.1). I used "make CFLAGS_EXTRA=-DDISABLE_PM install" to disable power management because this was suggested by some people who had had problems with the driver. The new driver loaded fine but didn't solve the problem.

To avoid power management issues I have disabled both pcie_aspm and acpi at boot.

I tried setting InterruptThrottleRate=3000,3000,3000 to module parameters. No luck. Setting the rate to 10000 had no effect either. I also tried IntMode=1,1,1.

Then, I updated the Linux kernel to 3.2.0 and recompiled the driver as well. That did have an effect: the problem now occurred more frequently, so I switched back to the original kernel. I'm not sure if this happened just by chance, but anyway.

I have now set msglvl to 6 on all interfaces. I'll be posting the logs once the problem reappears.

# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 6.0.5 (squeeze)
Release:    6.0.5
Codename:   squeeze

# uname -r
2.6.32-5-amd64

# ethtool -i eth0
driver: e1000e
version: 2.0.0.1-NAPI
firmware-version: 0.3-0
bus-info: 0000:01:00.0

# ethtool -i eth1
driver: e1000e
version: 2.0.0.1-NAPI
firmware-version: 0.3-0
bus-info: 0000:02:00.0

# ethtool -i eth2
driver: e1000e
version: 2.0.0.1-NAPI
firmware-version: 1.8-0
bus-info: 0000:03:00.0

# lspci -vvv
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 18)
    Subsystem: Intel Corporation Core Processor DRAM Controller
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
    Latency: 0
    Capabilities: [e0] Vendor Specific Information: Len=0c <?>
    Kernel driver in use: agpgart-intel

00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 18) (prog-if 00 [VGA controller])
    Subsystem: Intel Corporation Core Processor Integrated Graphics Controller
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 11
    Region 0: Memory at fe000000 (64-bit, non-prefetchable) [size=4M]
    Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
    Region 4: I/O ports at f0e0 [size=8]
    Expansion ROM at <unassigned> [disabled]
    Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
        Address: 00000000  Data: 0000
    Capabilities: [d0] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [a4] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-

00:16.0 Communication controller: Intel Corporation 5 Series/3400 Series Chipset HECI Controller (rev 06)
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset HECI Controller
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 11
    Region 0: Memory at fe708000 (64-bit, non-prefetchable) [size=16]
    Capabilities: [50] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [8c] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000

00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 06) (prog-if 20 [EHCI])
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 16
    Region 0: Memory at fe707000 (32-bit, non-prefetchable) [size=1K]
    Capabilities: [50] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Debug port: BAR=1 offset=00a0
    Capabilities: [98] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: ehci_hcd

00:1b.0 Audio device: Intel Corporation 5 Series/3400 Series Chipset High Definition Audio (rev 06)
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset High Definition Audio
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 22
    Region 0: Memory at fe700000 (64-bit, non-prefetchable) [size=16K]
    Capabilities: [50] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [60] MSI: Enable- Count=1/1 Maskable- 64bit+
        Address: 0000000000000000  Data: 0000
    Capabilities: [70] Express (v1) Root Complex Integrated Endpoint, MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag- RBE- FLReset+
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
        LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, Latency L0 <64ns, L1 <1us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
    Kernel driver in use: HDA Intel

00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 06) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
    I/O behind bridge: 0000e000-0000efff
    Memory behind bridge: fe600000-fe6fffff
    Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
        LnkCap: Port #1, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 <4us
            ClockPM- Surprise- LLActRep+ BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
        SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #0, PowerLimit 10.000W; Interlock- NoCompl+
        SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
        SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BC, TimeoutDis+ ARIFwd-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee0300c  Data: 4189
    Capabilities: [90] Subsystem: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1
    Capabilities: [a0] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Kernel driver in use: pcieport

00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 (rev 06) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
    I/O behind bridge: 0000d000-0000dfff
    Memory behind bridge: fe500000-fe5fffff
    Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
        LnkCap: Port #2, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 <4us
            ClockPM- Surprise- LLActRep+ BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
        SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #1, PowerLimit 10.000W; Interlock- NoCompl+
        SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
        SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BC, TimeoutDis+ ARIFwd-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee0300c  Data: 4191
    Capabilities: [90] Subsystem: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2
    Capabilities: [a0] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Kernel driver in use: pcieport

00:1c.2 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3 (rev 06) (prog-if 00 [Normal decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
    I/O behind bridge: 0000c000-0000cfff
    Memory behind bridge: fe400000-fe4fffff
    Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
    BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
        DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
            ExtTag- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
        LnkCap: Port #3, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <256ns, L1 <4us
            ClockPM- Surprise- LLActRep+ BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
        SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
            Slot #2, PowerLimit 10.000W; Interlock- NoCompl+
        SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
            Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
        SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
            Changed: MRL- PresDet+ LinkState+
        RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
        RootCap: CRSVisible-
        RootSta: PME ReqID 0000, PMEStatus- PMEPending-
        DevCap2: Completion Timeout: Range BC, TimeoutDis+ ARIFwd-
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -6dB
    Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee0300c  Data: 4199
    Capabilities: [90] Subsystem: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3
    Capabilities: [a0] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Kernel driver in use: pcieport

00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 06) (prog-if 20 [EHCI])
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 23
    Region 0: Memory at fe706000 (32-bit, non-prefetchable) [size=1K]
    Capabilities: [50] Power Management version 2
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [58] Debug port: BAR=1 offset=00a0
    Capabilities: [98] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: ehci_hcd

00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a6) (prog-if 01 [Subtractive decode])
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Bus: primary=00, secondary=04, subordinate=04, sec-latency=32
    I/O behind bridge: 0000f000-00000fff
    Memory behind bridge: fff00000-000fffff
    Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
    Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
    BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
        PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
    Capabilities: [50] Subsystem: Intel Corporation 82801 Mobile PCI Bridge

00:1f.0 ISA bridge: Intel Corporation Mobile 5 Series Chipset LPC Interface Controller (rev 06)
    Subsystem: Intel Corporation Mobile 5 Series Chipset LPC Interface Controller
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:1f.2 IDE interface: Intel Corporation 5 Series/3400 Series Chipset 4 port SATA IDE Controller (rev 06) (prog-if 8a [Master SecP PriP])
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset 4 port SATA IDE Controller
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin B routed to IRQ 19
    Region 0: I/O ports at 01f0 [size=8]
    Region 1: I/O ports at 03f4 [size=1]
    Region 2: I/O ports at 0170 [size=8]
    Region 3: I/O ports at 0374 [size=1]
    Region 4: I/O ports at f090 [size=16]
    Region 5: I/O ports at f080 [size=16]
    Capabilities: [70] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [b0] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: ata_piix

00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 06)
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller
    Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin C routed to IRQ 18
    Region 0: Memory at fe705000 (64-bit, non-prefetchable) [size=256]
    Region 4: I/O ports at f000 [size=32]
    Kernel driver in use: i801_smbus

00:1f.5 IDE interface: Intel Corporation 5 Series/3400 Series Chipset 2 port SATA IDE Controller (rev 06) (prog-if 85 [Master SecO PriO])
    Subsystem: Intel Corporation 5 Series/3400 Series Chipset 2 port SATA IDE Controller
    Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin B routed to IRQ 19
    Region 0: I/O ports at f070 [size=8]
    Region 1: I/O ports at f060 [size=4]
    Region 2: I/O ports at f050 [size=8]
    Region 3: I/O ports at f040 [size=4]
    Region 4: I/O ports at f030 [size=16]
    Region 5: I/O ports at f020 [size=16]
    Capabilities: [70] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [b0] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: ata_piix

01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Device 0000
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 27
    Region 0: Memory at fe600000 (32-bit, non-prefetchable) [size=128K]
    Region 2: I/O ports at e000 [size=32]
    Region 3: Memory at fe620000 (32-bit, non-prefetchable) [size=16K]
    Capabilities: [c8] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee0300c  Data: 41c1
    Capabilities: [e0] Express (v1) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
        LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    Capabilities: [a0] MSI-X: Enable- Count=3 Masked-
        Vector table: BAR=3 offset=00000000
        PBA: BAR=3 offset=00002000
    Kernel driver in use: e1000e

02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Device 0000
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 28
    Region 0: Memory at fe500000 (32-bit, non-prefetchable) [size=128K]
    Region 2: I/O ports at d000 [size=32]
    Region 3: Memory at fe520000 (32-bit, non-prefetchable) [size=16K]
    Capabilities: [c8] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee0300c  Data: 41d1
    Capabilities: [e0] Express (v1) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
        LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    Capabilities: [a0] MSI-X: Enable- Count=3 Masked-
        Vector table: BAR=3 offset=00000000
        PBA: BAR=3 offset=00002000
    Kernel driver in use: e1000e

03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
    Subsystem: Intel Corporation Gigabit CT Desktop Adapter
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 29
    Region 0: Memory at fe4c0000 (32-bit, non-prefetchable) [size=128K]
    Region 1: Memory at fe400000 (32-bit, non-prefetchable) [size=512K]
    Region 2: I/O ports at c000 [size=32]
    Region 3: Memory at fe4e0000 (32-bit, non-prefetchable) [size=16K]
    Expansion ROM at fe480000 [disabled] [size=256K]
    Capabilities: [c8] Power Management version 2
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee0300c  Data: 41e1
    Capabilities: [e0] Express (v1) Endpoint, MSI 00
        DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 128 bytes, MaxReadReq 128 bytes
        DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
        LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <128ns, L1 <64us
            ClockPM- Surprise- LLActRep- BwNot-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    Capabilities: [a0] MSI-X: Enable- Count=5 Masked-
        Vector table: BAR=3 offset=00000000
        PBA: BAR=3 offset=00002000
    Kernel driver in use: e1000e

Discussion

  • I configured the network interfaces like this:

    # for i in 0 1 2; do ethtool -s eth$i msglvl 6; done
    

    The problem appeared again, but I couldn't find any additional information in system logs. Does the driver write debug logs somewhere?

     
  • Caught a couple of slightly different instances of the (presumably) same problem. This time eth1 link, which is 1000 Mbps, loses connection, but is brought up as 10 Mbps only. It subsequently fails again and is remains in failure state.

    Sep  5 08:19:21 XXX kernel: [   54.215490] e1000e: eth1 NIC Link is Down
    Sep  5 08:19:24 XXX kernel: [   56.672025] e1000e: eth1 NIC Link is Up 10 Mbps Full Duplex, Flow Control: None
    Sep  5 08:19:24 XXX kernel: [   56.672029] e1000e 0000:02:00.0: eth1: 10/100 speed: disabling TSO
    Sep  5 08:19:33 XXX kernel: [   66.134479] e1000e: eth1 NIC Link is Down
    

    Here's one of the cases where there is no "Link is Down" message at all. Eth0 goes down and up within a second. Eth2 has not been "down" at all. Guess it must be pretty high now...

    Sep  5 10:06:00 XXX kernel: [ 6436.017804] e1000e: eth0 NIC Link is Down
    Sep  5 10:06:01 XXX kernel: [ 6437.752106] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
    Sep  5 10:06:01 XXX kernel: [ 6437.752110] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
    Sep  5 10:06:06 XXX kernel: [ 6442.718259] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    

    Finally, a selection of "ordinary" ones. The problem now occurs three times in 15 min intervals.

    Sep  5 11:37:51 XXX kernel: [11933.359492] e1000e: eth0 NIC Link is Down
    Sep  5 11:37:53 XXX kernel: [11935.073835] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
    Sep  5 11:37:53 XXX kernel: [11935.073839] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
    
    Sep  5 12:56:28 XXX kernel: [16638.630652] e1000e: eth2 NIC Link is Down
    Sep  5 12:56:31 XXX kernel: [16641.089345] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    
    Sep  5 13:14:56 XXX kernel: [17743.759586] e1000e: eth0 NIC Link is Down
    Sep  5 13:14:58 XXX kernel: [17744.852868] e1000e: eth2 NIC Link is Down
    Sep  5 13:14:58 XXX kernel: [17745.553743] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
    Sep  5 13:14:58 XXX kernel: [17745.553746] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
    Sep  5 13:15:00 XXX kernel: [17747.379537] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
    
     
    Last edit: Topi Mäenpää 2012-09-05
    • assigned_to: Carolyn Wyborny
     
  • I was pretty sure we had responded to you, but I apologize for the delay as I don't see any updates from us here.

    Have you gotten any logs after setting the message level higher? Its difficult to see what might be happening without more data. I don't believe your problem is the power management because only a reboot will bring those devices back online.

    When you get more logs, can you attach the entire log?

    Also, it looks like the message box truncated your lspci output. Can you attach the full output so I can fully review the 82574 entries with the rest.

     
    • Please see above for the log issue. I set msglvl to 6 on all interfaces but couldn't see anything more written to the logs. Maybe I was looking in a wrong place. Where does the driver write debug logs?

       
      Last edit: Topi Mäenpää 2012-09-07
  • I hate this buggy tracker. I've lost my edits already twice. On the
    plus side the page numbering is hilarious: "Page 1.0 of 1.08".

    Anyway, the output of lspci -vvv is now attached.

    I also took a brave look at the source code. I know next to nothing
    about kernel programming, but figured out the following.

    The "Link is Down" message is in e1000_watchdog_task, which seems to
    be scheduled by a timer through the e1000_watchdog interrupt handler.
    The only way the one can get to the message is when

    1) e1000e_has_link returns false AND
    2) netif_carrier_ok returns true

    "Carrier" sounds like an old dial-up modem to me, but I assume a
    successful return from the function means the copper is fine. If so,
    I'd be tempted to conclude that the connection was broken on a higher
    layer ("link"). I believe there is no way the physical connection can
    break in my case. If my inference is right, the driver agrees.

    e1000e_has_link checks if hw->mac.get_link_status is non-zero (which
    seems to indicate a failure). If it is, the hw->mac.ops.check_for_link
    function pointer will be called. I didn't try to figure out which
    function actually gets invoked, but I assume it will set
    get_link_status to zero if the link is OK. In the failure situation,
    the link is not OK.

    The next question is why get_link_status turns to a non-zero value?
    There are four places where this happens in the code. Three times in
    netdev.c: e1000_intr, e1000_intr_msi and e1000_msix_other. Interrupt
    handlers, I presume. In all cases the rule that triggers the change is
    "if (icr & E1000_ICR_LSC)". In phy.c, e1000_copper_link_autoneg sets
    the flag to true at the end. I assume one of the interrupt handlers is
    the culprit. And since disabling MSI-X had no effect, all of them
    seem to fail in a similar way.

    Now, I hit a dead end. The er32(ICR) call in the interrupt handlers
    seems to pick the E1000_ICR_LSC (link status change) flag, but I can
    find no place where it is actually ORed to any bit field. An obvious
    answer is that it is done by the hardware. But is it so? And if yes,
    for what reason can this happen?

    One idea that came to mind is to speed up recovery by increasing the
    frequency of the watchdog timer. Would this have side effects? What is
    the current frequency and where is it set?

     
    Attachments
  • Thanks for your review, it does look like either the whole device is going down and up or just the link, but its difficult to tell without more messages. The code you are looking at with ICR register is enabling the interrupt on link status changes. Its a hardware setting. All of this is because of the link change, but like I said, we're still not sure if the whole device is resetting or just the link. Can you try attaching a full dmesg log? And, also the /var/log/messages file. The extra messaging should appear in the dmesg log, but some other system messages only write to the system log at /var/log/messages. If you attach a full copy of them, I'll dig through and see if I can find anything else to explain what you're seeing. One more thing, are you using Network Manager? If so, as a test, after getting logs to attach here, can you try: service NetworkManager stop and see if that changes the problem in any way?

     
    • The tracker lost my edits again. Shame on you, SF. One could expect
      that a bug tracker would not have such a huge amount of fatal bugs.

      Complaints aside, the dmesg log is attached. The problem appeared
      again, but the customer rebooted the machine so quick that I couldn't
      get the dmesg log. Now I made it persistent, but we need to wait for
      the next failure. /var/log/messages doesn't contain anything that was
      not known before. Here's the relevant portion:

      Sep 11 04:51:48 XXX kernel: [324434.734941] e1000e: eth0 NIC Link is Down
      Sep 11 04:51:49 XXX kernel: [324435.401286] e1000e: eth2 NIC Link is Down
      Sep 11 04:51:49 XXX kernel: [324435.816293] e1000e: eth1 NIC Link is Down
      Sep 11 04:51:50 XXX kernel: [324436.584956] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
      Sep 11 04:51:50 XXX kernel: [324436.584959] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
      Sep 11 04:51:52 XXX kernel: [324438.402729] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
      Sep 11 04:51:53 XXX kernel: [324439.332570] e1000e: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
      Sep 11 04:51:53 XXX dhcpd: Wrote 3 leases to leases file.
      Sep 11 04:51:53 XXX dhcpd: DHCPREQUEST for 10.3.14.162 from c8:60:00:44:07:ef (Hostname Unsuitable for Printing) via eth1
      Sep 11 04:51:53 XXX dhcpd: DHCPACK on 10.3.14.162 to c8:60:00:44:07:ef (Hostname Unsuitable for Printing) via eth1
      Sep 11 04:51:57 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 04:51:57 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 05:06:13 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 05:06:13 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 05:35:23 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 05:35:23 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 05:37:16 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 05:37:16 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 06:25:01 XXX rsyslogd: [origin software="rsyslogd" swVersion="4.6.4" x-pid="1124" x-info="http://www.rsyslog.com"] rsyslogd was HUPed, type 'li
      ghtweight'.
      Sep 11 06:35:23 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 06:35:23 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 06:37:16 XXX dhcpd: DHCPINFORM from 10.3.14.162 via eth1
      Sep 11 06:37:16 XXX dhcpd: DHCPACK to 10.3.14.162 (c8:60:00:44:07:ef) via eth1
      Sep 11 07:01:43 XXX kernel: [332210.763028] e1000e: eth1 NIC Link is Down
      Sep 11 07:01:46 XXX kernel: [332213.215587] e1000e: eth1 NIC Link is Up 10 Mbps Full Duplex, Flow Control: None
      Sep 11 07:01:46 XXX kernel: [332213.215591] e1000e 0000:02:00.0: eth1: 10/100 speed: disabling TSO
      

      One thing to note is that eth1 is brought up as 10 Mbps only (it is a
      gigabit link). It is the second time this happens.

      We are not using NetworkManager.

       
    • This time it didn't take long. All interfaces have msglvl set to six, but I cannot see any difference in log outputs. Here's what gets written to dmesg:

      [15932.709469] e1000e 0000:03:00.0: eth2: Reset adapter
      [15935.886252] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
      [16503.871040] e1000e: eth0 NIC Link is Down
      [16505.545508] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
      [16505.545512] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
      [20365.820405] e1000e: eth0 NIC Link is Down
      [20367.694371] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
      [20367.694374] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
      [20368.725294] e1000e 0000:03:00.0: eth2: Reset adapter
      [20371.918123] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
      [20423.441753] e1000e: eth2 NIC Link is Down
      [20423.450329] e1000e 0000:03:00.0: eth2: Reset adapter
      [20425.920472] e1000e: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
      
       
  • It seems that the dmesg file I attached vanished. I swear I did attach it. Another try...

     
    Attachments
  • Any ideas on this? It's been a week since I sent the logs.

    You asked for more information to find out whether it is the whole device that is going down or just the link. Doesn't the following line from the original bug report tell anything about this?

    Aug 19 10:58:37 XXXX kernel: [10928.306331] e1000e 0000:01:00.0: eth0: Reset adapter
    

    Solving this problem is of extreme importance to us. If we don't get it closed withing a couple of weeks, we need to ship new hardware to many customers and reinstall all systems delivered so far. I welcome ANY suggestion. What can possibly cause such behavior? Could it be hardware (cabling, power)? If it is in the driver, what are the possible causes? If the failure cannot be prevented, can I speed up recovery somehow?

     
  • We're trying to rule out cabling issues. So far there seems to be no
    difference in using either cat 5e or cat 6 cables. Both fail in the
    same way.

    Is it possible that electromagnetic interference in one cable could
    cause another network interface to go down? The reason why I'm asking
    this is because one of the cables we are using is only 20 cm long and
    fully enclosed in a metal box. The interface it is connected to runs
    at 100Mbps only. Still, it may go down just like the other interfaces.

    The other interfaces have rather long cables, but nothing extreme.
    Given the industrial environment they could however be a source of
    EMI. The problem with this theory is that we can reproduce the problem
    in an office environment with cat 6 cables. But anyway: are driver
    instances isolated or can they cause trouble to each other?

     
    • "The problem with this theory" no longer exists. Although we could reproduce a similar problem in the office, the cause turned out to be at the receiving end. But the question remains: could it be possible that a failure in one interface could bring another interface down?

       
      Last edit: Topi Mäenpää 2012-09-20
      • Things are getting interesting. Just less than hour after the last update we did finally catch the failure in our test system. It was the 100Mbs link that went down. Just for a second, but anyway. It is only one of the interfaces, and may again be caused by the receiver, but it looks an awful lot like the ones in production systems. This one is definitely not caused by EMI or grounding issues.

        Sep 20 09:35:09 XXX kernel: [83859.878841] e1000e: eth0 NIC Link is Down
        Sep 20 09:35:10 XXX kernel: [83861.579691] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
        Sep 20 09:35:10 XXX kernel: [83861.579695] e1000e 0000:01:00.0: eth0: 10/100 speed: disabling TSO
        
         
  • We still do not have a definite explanation, but it seems to be the Lanner PC that is causing the trouble. It seems to be extremely sensitive to disturbances in power supply. We could reproduce the issue in the office by starting a fan about three meters away. I'm becoming convinced this wasn't a software issue after all, but I'd still want to leave this bug open until we repeat the tests with another computer.

     
  • We finally got another PC with about the same chipsets. It works fine in the same set-up. Therefore, I'm pretty sure this was eventually a hardware issue. Thanks for help anyway.

     
    • status: open --> closed