Thread: Re: [Linuxptp-users] [E1000-devel] Need help with igb driver suspend crash issue
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Keller, J. E <jac...@in...> - 2016-05-11 16:26:01
|
Hi, > -----Original Message----- > From: vidya sagar [mailto:sag...@gm...] > Sent: Wednesday, May 11, 2016 3:25 AM > To: e10...@li...; linuxptp- > us...@li... > Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue > > <<< Including lin...@li... >>> > > On Wed, May 11, 2016 at 3:51 PM, vidya sagar <sag...@gm...> > wrote: > > > Hi, > > I'm using Intel IGB I350 NIC card on one of our arm based platforms. > > While suspending the system, sometimes we see "igb 0000:01:00.0 eth1: > PCIe > > link lost, device now detached" print in the log and subsequent resume > > causes system to crash. After digging the code (BTW, I'm using kernel-3.18 > > release), it looks like the above print comes because of the following call > > flow, which got executed after igb_suspend() is called ( I confirmed this > > with the help of prints) > > > > [10846.434381] [<ffffffc000089ce4>] dump_backtrace+0x0/0xf8 > > [10846.434386] [<ffffffc000089ea0>] show_stack+0x10/0x1c > > [10846.434393] [<ffffffc000bc3b70>] dump_stack+0x80/0xc4 > > [10846.434397] [<ffffffc000613d3c>] igb_rd32+0xb0/0x1a8 > > [10846.434400] [<ffffffc00062eb0c>] igb_ptp_read_82580+0x18/0x48 > > [10846.434407] [<ffffffc000106e6c>] timecounter_read+0x1c/0x60 > > [10846.434410] [<ffffffc00062f338>] igb_ptp_gettime_82576+0x2c/0x88 > > [10846.434413] [<ffffffc00062f41c>] igb_ptp_overflow_check+0x1c/0x58 > > [10846.434419] [<ffffffc0000ba584>] process_one_work+0x154/0x414 > > [10846.434424] [<ffffffc0000bb338>] worker_thread+0x13c/0x4e4 > > [10846.434428] [<ffffffc0000bfc4c>] kthread+0xf8/0x110 > > > > It looks like reading timer registers would have returned all F's as the > > device is already in D3Hot state. > > Is my understanding correct. Is there any patch available to fix this > > issue? > > Let me know if more information is needed. > > Maybe an ordering bug when doing suspend that we try to read things too late. Is that stack trace the actual crash or did you add the dump_stack yourself? Thanks, Jake > > Thanks, > > Vidya Sagar > > |
From: Keller, J. E <jac...@in...> - 2016-05-11 17:52:28
|
From: vidya sagar [mailto:sag...@gm...] Sent: Wednesday, May 11, 2016 10:44 AM To: Keller, Jacob E <jac...@in...> Cc: e10...@li...; lin...@li... Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue I added the dump_stack() to see the full flow as the error print "PCIe link lost, device now detached" is there as part of igb_rd32() API which is called at many places. Are we not supposed to 'cancel delayed work of igb_ptp_overflow_check() when system goes to suspend state (and schedule when system resumes)? On Wed, May 11, 2016 at 9:54 PM, Keller, Jacob E <jac...@in...<mailto:jac...@in...>> wrote: Hi, > -----Original Message----- > From: vidya sagar [mailto:sag...@gm...<mailto:sag...@gm...>] > Sent: Wednesday, May 11, 2016 3:25 AM > To: e10...@li...<mailto:e10...@li...>; linuxptp- > us...@li...<mailto:us...@li...> > Subject: Re: [E1000-devel] Need help with igb driver suspend crash issue > > <<< Including lin...@li...<mailto:lin...@li...> >>> > > On Wed, May 11, 2016 at 3:51 PM, vidya sagar <sag...@gm...<mailto:sag...@gm...>> > wrote: > > > Hi, > > I'm using Intel IGB I350 NIC card on one of our arm based platforms. > > While suspending the system, sometimes we see "igb 0000:01:00.0 eth1: > PCIe > > link lost, device now detached" print in the log and subsequent resume > > causes system to crash. After digging the code (BTW, I'm using kernel-3.18 > > release), it looks like the above print comes because of the following call > > flow, which got executed after igb_suspend() is called ( I confirmed this > > with the help of prints) > > > > [10846.434381] [<ffffffc000089ce4>] dump_backtrace+0x0/0xf8 > > [10846.434386] [<ffffffc000089ea0>] show_stack+0x10/0x1c > > [10846.434393] [<ffffffc000bc3b70>] dump_stack+0x80/0xc4 > > [10846.434397] [<ffffffc000613d3c>] igb_rd32+0xb0/0x1a8 > > [10846.434400] [<ffffffc00062eb0c>] igb_ptp_read_82580+0x18/0x48 > > [10846.434407] [<ffffffc000106e6c>] timecounter_read+0x1c/0x60 > > [10846.434410] [<ffffffc00062f338>] igb_ptp_gettime_82576+0x2c/0x88 > > [10846.434413] [<ffffffc00062f41c>] igb_ptp_overflow_check+0x1c/0x58 > > [10846.434419] [<ffffffc0000ba584>] process_one_work+0x154/0x414 > > [10846.434424] [<ffffffc0000bb338>] worker_thread+0x13c/0x4e4 > > [10846.434428] [<ffffffc0000bfc4c>] kthread+0xf8/0x110 > > > > It looks like reading timer registers would have returned all F's as the > > device is already in D3Hot state. > > Is my understanding correct. Is there any patch available to fix this > > issue? > > Let me know if more information is needed. > > Maybe an ordering bug when doing suspend that we try to read things too late. Is that stack trace the actual crash or did you add the dump_stack yourself? Thanks, Jake > > Thanks, > > Vidya Sagar > > |
From: Keller, J. E <jac...@in...> - 2016-05-11 17:58:02
|
On Wed, 2016-05-11 at 23:13 +0530, vidya sagar wrote: > I added the dump_stack() to see the full flow as the error print > "PCIe link lost, device now detached" is there as part of igb_rd32() > API which is called at many places. Are we not supposed to 'cancel > delayed work of igb_ptp_overflow_check() when system goes to suspend > state (and schedule when system resumes)? It is probably because we're triggering a surprise removal event which we shouldn't be doing. It likely happens because of the suspend which puts is part way through this state, and we need to disable the overflow check and re-enable it at a resume time, yep. Thanks, Jake |