From: David W. <dw...@in...> - 2010-05-26 11:16:36
|
I've had this crash reported to me... [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] SS:ESP 0068:dfb89d14 [18845.090712] [<c13ecff3>] ? do_page_fault+0x0/0x2e1 [18845.120042] [<e082f490>] ? br2684_push+0x19/0x234 [br2684] [18845.153530] [<e084fa13>] solos_bh+0x28b/0x7c8 [solos_pci] [18845.186488] [<e084f711>] ? solos_irq+0x2d/0x51 [solos_pci] [18845.219960] [<c100387b>] ? handle_irq+0x3b/0x48 [18845.247732] [<c10265cb>] ? irq_exit+0x34/0x57 [18845.274437] [<c1025720>] tasklet_action+0x42/0x69 [18845.303247] [<c102643f>] __do_softirq+0x8e/0x129 [18845.331540] [<c10264ff>] do_softirq+0x25/0x2a [18845.358274] [<c102664c>] _local_bh_enable_ip+0x5e/0x6a [18845.389677] [<c102666d>] local_bh_enable+0xb/0xe [18845.417944] [<e08490a8>] ppp_unregister_channel+0x32/0xbb [ppp_generic] [18845.458193] [<e08731ad>] pppox_unbind_sock+0x18/0x1f [pppox] [18845.492712] [<e087f5f7>] pppoe_device_event+0xa7/0x159 [pppoe] [18845.528269] [<c13ed2ff>] notifier_call_chain+0x2b/0x4a [18845.559674] [<c1038a62>] raw_notifier_call_chain+0xc/0xe [18845.592110] [<c1300867>] dev_close+0x51/0x8b [18845.618266] [<c1300927>] rollback_registered_many+0x86/0x15e [18845.652813] [<c1300ab3>] unregister_netdevice_queue+0x67/0x91 [18845.687849] [<c1300b79>] unregister_netdev+0x14/0x1c [18845.718221] [<e082f4d1>] br2684_push+0x5a/0x234 [br2684] [18845.750676] [<e083dc21>] vcc_release+0x64/0x100 [atm] The problem is that the 'find_vcc' functions in these drivers are returning a vcc with the ATM_VF_READY bit cleared, because it's already in the process of being destroyed. If we fix that simple oversight, there's still a race condition because the socket can still be closed (and completely freed, afaict) between our call to find_vcc() and our call to vcc->push() in the RX tasklet. Here's a patch for solos-pci which should fix it. We prevent the race by making the dev->ops->close() function wait for the RX tasklet to complete, so it can't still be using the vcc in question. I think this same approach should work OK for usbatm and he. Less sure about atmtcp -- we may need some extra locking there to protect atmtcp_c_send(). And I'm ignoring eni_proc_read() for now. Can anyone see a better approach -- short of rewriting the whole ATM layer to make the locking saner? diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c index c5f5186..a73f102 100644 --- a/drivers/atm/solos-pci.c +++ b/drivers/atm/solos-pci.c @@ -774,7 +774,8 @@ static struct atm_vcc *find_vcc(struct atm_dev *dev, short vpi, int vci) sk_for_each(s, node, head) { vcc = atm_sk(s); if (vcc->dev == dev && vcc->vci == vci && - vcc->vpi == vpi && vcc->qos.rxtp.traffic_class != ATM_NONE) + vcc->vpi == vpi && vcc->qos.rxtp.traffic_class != ATM_NONE && + test_bit(ATM_VF_READY, &vcc->flags)) goto out; } vcc = NULL; @@ -900,6 +901,10 @@ static void pclose(struct atm_vcc *vcc) clear_bit(ATM_VF_ADDR, &vcc->flags); clear_bit(ATM_VF_READY, &vcc->flags); + /* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the + tasklet has finished processing any incoming packets (and, more to + the point, using the vcc pointer). */ + tasklet_unlock_wait(&card->tlet); return; } -- David Woodhouse Open Source Technology Centre Dav...@in... Intel Corporation |
From: Stanislaw G. <st...@wp...> - 2010-05-26 17:43:35
|
On Wed, May 26, 2010 at 12:16:24PM +0100, David Woodhouse wrote: > The problem is that the 'find_vcc' functions in these drivers are > returning a vcc with the ATM_VF_READY bit cleared, because it's already > in the process of being destroyed. If we fix that simple oversight, > there's still a race condition because the socket can still be closed > (and completely freed, afaict) between our call to find_vcc() and our > call to vcc->push() in the RX tasklet. > > Here's a patch for solos-pci which should fix it. We prevent the race by > making the dev->ops->close() function wait for the RX tasklet to > complete, so it can't still be using the vcc in question. I do not think this is problem with usbatm. We use custom vcc_list with custom usbatm_vcc_data entries. We remove entry on atmdev_ops->close() within tasklet_disable()/tasklet_enable() section. After that rx tasklet will not find usbatm_vcc_data for corresponding vpi, vpi. Stanislaw |
From: David W. <dw...@in...> - 2010-05-28 14:23:14
|
On Fri, 2010-05-28 at 03:46 -0700, David Miller wrote: > From: David Woodhouse <dw...@in...> > Date: Wed, 26 May 2010 12:16:24 +0100 > > > Can anyone see a better approach -- short of rewriting the whole ATM > > layer to make the locking saner? > > There is no doubt in my mind that these VCC objects need to be > refcounted when used like this. Perhaps. Although in the general case they're tied to the 'struct sock' and don't need to outlive it. These drivers which look up the VCC to feed incoming packets to it are the only exception to that rule that I'm aware of. > The only other alternative is to make use of something like RCU. I agree. In fact the use of tasklet_unlock_wait() in my patch is what I settled on when I went looking for 'something like RCU' to solve this particular case. I was _going_ to add RCU stuff, but realised that this was sufficient. In the close() path we clear the READY bit in the VCC, wait for the tasklet to finish using it, and only then do we destroy the VCC. -- David Woodhouse Open Source Technology Centre Dav...@in... Intel Corporation |
From: David W. <dw...@in...> - 2010-06-07 10:03:02
|
On Wed, 2010-05-26 at 12:16 +0100, David Woodhouse wrote: > I've had this crash reported to me... > > [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] > SS:ESP 0068:dfb89d14 Nathan, did you manage to get your customer to confirm that this fixes the problem? It'd be useful to get this into 2.6.35 and -stable. -- David Woodhouse Open Source Technology Centre Dav...@in... Intel Corporation |
From: Nathan W. <na...@tr...> - 2010-06-16 00:33:22
|
On 7/06/2010 8:02 PM, David Woodhouse wrote: > On Wed, 2010-05-26 at 12:16 +0100, David Woodhouse wrote: >> I've had this crash reported to me... >> >> [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] >> SS:ESP 0068:dfb89d14 > > Nathan, did you manage to get your customer to confirm that this fixes > the problem? It'd be useful to get this into 2.6.35 and -stable. > No, I still haven't heard back from the customer. I'll keep sending email reminders. |
From: Nathan W. <na...@tr...> - 2010-07-27 23:28:22
|
On 7/06/2010 8:02 PM, David Woodhouse wrote: > On Wed, 2010-05-26 at 12:16 +0100, David Woodhouse wrote: >> I've had this crash reported to me... >> >> [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] >> SS:ESP 0068:dfb89d14 > > Nathan, did you manage to get your customer to confirm that this fixes > the problem? It'd be useful to get this into 2.6.35 and -stable. > I've had confirmation from the customer. The patch fixed his problem. |
From: David W. <dw...@in...> - 2010-08-06 16:58:13
|
We were seeing faults in the solos-pci receive tasklet when packets arrived for a VCC which was currently being closed: [18842.727906] EIP: [<e082f490>] br2684_push+0x19/0x234 [br2684] SS:ESP 0068:dfb89d14 [18845.090712] [<c13ecff3>] ? do_page_fault+0x0/0x2e1 [18845.120042] [<e082f490>] ? br2684_push+0x19/0x234 [br2684] [18845.153530] [<e084fa13>] solos_bh+0x28b/0x7c8 [solos_pci] [18845.186488] [<e084f711>] ? solos_irq+0x2d/0x51 [solos_pci] [18845.219960] [<c100387b>] ? handle_irq+0x3b/0x48 [18845.247732] [<c10265cb>] ? irq_exit+0x34/0x57 [18845.274437] [<c1025720>] tasklet_action+0x42/0x69 [18845.303247] [<c102643f>] __do_softirq+0x8e/0x129 [18845.331540] [<c10264ff>] do_softirq+0x25/0x2a [18845.358274] [<c102664c>] _local_bh_enable_ip+0x5e/0x6a [18845.389677] [<c102666d>] local_bh_enable+0xb/0xe [18845.417944] [<e08490a8>] ppp_unregister_channel+0x32/0xbb [ppp_generic] [18845.458193] [<e08731ad>] pppox_unbind_sock+0x18/0x1f [pppox] This patch uses an RCU-inspired approach to fix it. In the RX tasklet's find_vcc() function we first refuse to use a VCC which already has the ATM_VF_READY bit cleared. And in the VCC close function, we synchronise with the tasklet to ensure that it can't still be using the VCC before we continue and allow the VCC to be destroyed. Signed-off-by: David Woodhouse <Dav...@in...> Tested-by: Nathan Williams <na...@tr...> Cc: st...@ke... --- Nathan, you probably still ought to work out why your customer's setup keeps disconnecting and closing the connection -- under normal operation on a stable DSL line, this race should almost never have been triggered. diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c index c5f5186..a73f102 100644 --- a/drivers/atm/solos-pci.c +++ b/drivers/atm/solos-pci.c @@ -774,7 +774,8 @@ static struct atm_vcc *find_vcc(struct atm_dev *dev, short vpi, int vci) sk_for_each(s, node, head) { vcc = atm_sk(s); if (vcc->dev == dev && vcc->vci == vci && - vcc->vpi == vpi && vcc->qos.rxtp.traffic_class != ATM_NONE) + vcc->vpi == vpi && vcc->qos.rxtp.traffic_class != ATM_NONE && + test_bit(ATM_VF_READY, &vcc->flags)) goto out; } vcc = NULL; @@ -900,6 +901,10 @@ static void pclose(struct atm_vcc *vcc) clear_bit(ATM_VF_ADDR, &vcc->flags); clear_bit(ATM_VF_READY, &vcc->flags); + /* Hold up vcc_destroy_socket() (our caller) until solos_bh() in the + tasklet has finished processing any incoming packets (and, more to + the point, using the vcc pointer). */ + tasklet_unlock_wait(&card->tlet); return; } -- David Woodhouse Open Source Technology Centre Dav...@in... Intel Corporation |