Re: [E1000-devel] e1000 nic hang after a long time running

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi, all

I use a linux kernel is 3.4.34. And a lot of tests including many 
network operation, such as MTU change, NIC up/down, and multi-Q creating 
are running on this linux host. This linux host is vSphere, which 
including 5 NIC, all of them are e1000 (Intel Corporation 82545EM 
Gigabit Ethernet Controller (Cpooer) (rev 01) and number is 8086:100f). 
The driver of e1000 is 7.3.21-k8-NAPI.
Before issue occur, there must be many reset adapter printing, such as:
/************************/
e1000 0000:02:01.0: eth1: Reset adapter
/************************/

When this problem happened, the following messages appeared.

/*****************************************************/
Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
e1000_reinit_safe set __E1000_RESETTING
Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
e1000_reinit_safe take adapter's mutex
Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
e1000_watchdog take adapter's mutex
Jul  6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: 
e1000_reinit_safe release adapter's mutex
Jul  6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: 
e1000_reinit_safe reset __E1000_RESETTING
/*****************************************************/

I analyzed the source code. There is a time slot between 
__E1000_RESETTING and __E1000_DOWN.

When e1000_reinit_safe sets __E1000_RESETTING and takes adapter's mutex 
before sets __E1000_DOWN, e1000_watchdog is scheduled and take adapter's 
mutex, then e1000_reinit_safe shuts down nic while e1000_watchdog is 
processing. Then e1000 nic will hang.

My solution is to prevent e1000_watchdog is scheduled in this time slot 
between __E1000_RESETTING and __E1000_DOWN.

Is there anything wrong about this solution?

Best Regards!
zhuyj

On 08/15/2013 03:09 PM, zhuyj wrote:
> Hi, maintainer
>
> Would you like to comment on this patch?
> Thanks a lot.
>
> Best Regards!
> Zhu Yanjun
>
> On 08/15/2013 03:01 PM, zhuyj wrote:
>> Hi,
>>
>> After a long time networking test case running, e1000 NIC driver may 
>> not work anymore. At this time, system is okay, we can execute some 
>> non-network command(such as ls, cp etc.), but if we execute network 
>> command(ifconfig), system will hang there, can not get response anymore.
>> We add some log in driver and found this was caused by mutex nest, it 
>> means normaly, one mutex got and then release, another mutex was got, 
>> but when issue occur, from log, the first mutex was got, did not 
>> release, then got mutex again:
>>
>> /*****************************************************/
>> Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
>> e1000_reinit_safe set __E1000_RESETTING
>> Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
>> e1000_reinit_safe take adapter's mutex
>> Jul  6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: 
>> e1000_watchdog take adapter's mutex
>> Jul  6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: 
>> e1000_reinit_safe release adapter's mutex
>> Jul  6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: 
>> e1000_reinit_safe reset __E1000_RESETTING
>> /*****************************************************/
>>
>> We made the following patch and applied this patch. This problem 
>> disappeared.
>> Please comment on this patch.
>> Thanks a lot.
>>
>> /***********************************************/
>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
>> b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> index 7569ebb..2878308 100644
>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> @@ -2441,7 +2441,8 @@ static void e1000_watchdog(struct work_struct 
>> *work)
>>                struct e1000_tx_ring *txdr = adapter->tx_ring;
>>                u32 link, tctl;
>>
>> -              if (test_bit(__E1000_DOWN, &adapter->flags))
>> +             if (test_bit(__E1000_DOWN, &adapter->flags) ||
>> +                             test_bit(__E1000_RESETTING, 
>> &adapter->flags))
>>                                return;
>>
>> /***********************************************/
>>
>> zhuyj
>

Re: [E1000-devel] e1000 nic hang after a long time running

Moved to github.com/intel

Re: [E1000-devel] e1000 nic hang after a long time running