|
From: zhuyj <zyj...@gm...> - 2013-08-19 05:57:50
|
Hi, all I use a linux kernel is 3.4.34. And a lot of tests including many network operation, such as MTU change, NIC up/down, and multi-Q creating are running on this linux host. This linux host is vSphere, which including 5 NIC, all of them are e1000 (Intel Corporation 82545EM Gigabit Ethernet Controller (Cpooer) (rev 01) and number is 8086:100f). The driver of e1000 is 7.3.21-k8-NAPI. Before issue occur, there must be many reset adapter printing, such as: /************************/ e1000 0000:02:01.0: eth1: Reset adapter /************************/ When this problem happened, the following messages appeared. /*****************************************************/ Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe set __E1000_RESETTING Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_watchdog take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe release adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe reset __E1000_RESETTING /*****************************************************/ I analyzed the source code. There is a time slot between __E1000_RESETTING and __E1000_DOWN. When e1000_reinit_safe sets __E1000_RESETTING and takes adapter's mutex before sets __E1000_DOWN, e1000_watchdog is scheduled and take adapter's mutex, then e1000_reinit_safe shuts down nic while e1000_watchdog is processing. Then e1000 nic will hang. My solution is to prevent e1000_watchdog is scheduled in this time slot between __E1000_RESETTING and __E1000_DOWN. Is there anything wrong about this solution? Best Regards! zhuyj On 08/15/2013 03:09 PM, zhuyj wrote: > Hi, maintainer > > Would you like to comment on this patch? > Thanks a lot. > > Best Regards! > Zhu Yanjun > > On 08/15/2013 03:01 PM, zhuyj wrote: >> Hi, >> >> After a long time networking test case running, e1000 NIC driver may >> not work anymore. At this time, system is okay, we can execute some >> non-network command(such as ls, cp etc.), but if we execute network >> command(ifconfig), system will hang there, can not get response anymore. >> We add some log in driver and found this was caused by mutex nest, it >> means normaly, one mutex got and then release, another mutex was got, >> but when issue occur, from log, the first mutex was got, did not >> release, then got mutex again: >> >> /*****************************************************/ >> Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: >> e1000_reinit_safe set __E1000_RESETTING >> Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: >> e1000_reinit_safe take adapter's mutex >> Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: >> e1000_watchdog take adapter's mutex >> Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: >> e1000_reinit_safe release adapter's mutex >> Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: >> e1000_reinit_safe reset __E1000_RESETTING >> /*****************************************************/ >> >> We made the following patch and applied this patch. This problem >> disappeared. >> Please comment on this patch. >> Thanks a lot. >> >> /***********************************************/ >> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c >> b/drivers/net/ethernet/intel/e1000/e1000_main.c >> index 7569ebb..2878308 100644 >> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c >> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c >> @@ -2441,7 +2441,8 @@ static void e1000_watchdog(struct work_struct >> *work) >> struct e1000_tx_ring *txdr = adapter->tx_ring; >> u32 link, tctl; >> >> - if (test_bit(__E1000_DOWN, &adapter->flags)) >> + if (test_bit(__E1000_DOWN, &adapter->flags) || >> + test_bit(__E1000_RESETTING, >> &adapter->flags)) >> return; >> >> /***********************************************/ >> >> zhuyj > |