|
From: Gal R. <ga...@st...> - 2008-07-22 07:35:44
|
Sorry, it is my problem; the fix is good.
I am using qla2xxx that supports Multi-ID, and in that code there is
interrupt when ever the number of vport changed. This interrupt call to
qla24xx_report_id_acquisition() and in this function there is a call to
wake_up_process(ha->dpc_thread) which is not protected.
It should call to qla2xxx_wake_dpc(ha) which is now protected.
Thanks,
Gal.
On Tue, 2008-07-22 at 09:44 +0300, Gal Rosen wrote:
> Revision 470 still panic.
>
> [67324.049497] [4525]: scst: exit_scst:1652:SCST unloaded
> [67324.765692] qla2xxx 0000:06:00.1: LIP reset occured (f8ef).
> [67325.295690] qla2xxx 0000:06:00.1: LIP occured (f8ef).
> [67325.315689] qla2xxx 0000:06:00.1: LOOP UP detected (4 Gbps).
> [67325.415700] Unable to handle kernel NULL pointer dereference at
> 0000000000000008 RIP:
> [67325.416599] [<ffffffff80226b1e>] task_rq_lock+0x2e/0x90
> [67325.418171] PGD 22d737067 PUD 22df2d067 PMD 0
> [67325.419114] Oops: 0000 [1] SMP
> [67325.419736] CPU 0
> [67325.420162] Modules linked in: qla2xxx loop scsi_transport_fc
> [67325.421333] Pid: 4382, comm: qla2xxx_12_dpc Tainted: GF
> 2.6.23.8-64bit-fc-nfs #10
> [67325.422998] RIP: 0010:[<ffffffff80226b1e>] [<ffffffff80226b1e>]
> task_rq_lock+0x2e/0x90
> [67325.424736] RSP: 0018:ffff81022e14bb00 EFLAGS: 00010086
> [67325.425870] RAX: 0000000000000086 RBX: 000000000000000f RCX:
> 0000000000000000
> [67325.427379] RDX: 0000000000000000 RSI: ffff81022e14bb80 RDI:
> 0000000000000000
> [67325.428899] RBP: ffff81022e14bb20 R08: 0000000000000000 R09:
> ffff81022d160010
> [67325.430406] R10: 0000000000000000 R11: ffffffff8050faf0 R12:
> ffffffff807a4000
> [67325.431925] R13: 0000000000000000 R14: ffff81022e14bb80 R15:
> 0000000000000246
> [67325.433433] FS: 0000000000000000(0000) GS:ffffffff8071d000(0000)
> knlGS:0000000000000000
> [67325.435147] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [67325.436371] CR2: 0000000000000008 CR3: 000000022dec0000 CR4:
> 00000000000006e0
> [67325.437890] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [67325.439398] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [67325.440921] Process qla2xxx_12_dpc (pid: 4382, threadinfo
> ffff81022e14a000, task ffff81022c6040c0)
> [67325.442813] Stack: 000000000000000f ffff81022d160000
> 0000000000000000 ffffc20000038000
> [67325.444551] ffff81022e14bbb0 ffffffff80226ddf ffff81022e14bbd0
> 0000000000000046
> [67325.446143] 0000000000000000 ffff810008abc918 000000002c6042e0
> ffff81022c6040c0
> [67325.447697] Call Trace:
> [67325.448252] [<ffffffff80226ddf>] try_to_wake_up+0x2f/0x370
> [67325.449485] [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.450736]
> [<ffffffff8805a9cb>] :qla2xxx:qla24xx_process_response_queue+0x18b/0x1f0
> [67325.452442] [<ffffffff8805bdd8>] :qla2xxx:qla24xx_intr_handler
> +0x158/0x200
> [67325.453976] [<ffffffff80237710>] process_timeout+0x0/0x10
> [67325.455196] [<ffffffff88054b0b>] :qla2xxx:qla2x00_mailbox_command
> +0x1db/0x5d0
> [67325.456787] [<ffffffff80597980>] thread_return+0x0/0x5c0
> [67325.457978] [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.459214] [<ffffffff88056a17>] :qla2xxx:qla2x00_get_adapter_id
> +0x87/0x130
> [67325.460765] [<ffffffff80237710>] process_timeout+0x0/0x10
> [67325.461983] [<ffffffff8804fa9f>] :qla2xxx:qla2x00_configure_loop
> +0x33f/0x17a0
> [67325.463573] [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.464807] [<ffffffff880568ea>] :qla2xxx:qla2x00_get_retry_cnt
> +0x5a/0x100
> [67325.466333] [<ffffffff8805882c>] :qla2xxx:__qla2x00_marker
> +0xec/0x110
> [67325.467781] [<ffffffff880588b0>] :qla2xxx:qla2x00_marker+0x60/0x90
> [67325.469157] [<ffffffff88051b3e>] :qla2xxx:qla2x00_abort_isp
> +0x24e/0x610
> [67325.470638] [<ffffffff8804c290>] :qla2xxx:qla2x00_do_dpc+0x430/0x560
> [67325.472052] [<ffffffff8804be60>] :qla2xxx:qla2x00_do_dpc+0x0/0x560
> [67325.473431] [<ffffffff8804be60>] :qla2xxx:qla2x00_do_dpc+0x0/0x560
> [67325.474820] [<ffffffff80242e8b>] kthread+0x4b/0x80
> [67325.475902] [<ffffffff8020c608>] child_rip+0xa/0x12
> [67325.477005] [<ffffffff80242e40>] kthread+0x0/0x80
> [67325.478069] [<ffffffff8020c5fe>] child_rip+0x0/0x12
> [67325.479171]
> [67325.479486]
> [67325.479486] Code: 49 8b 45 08 4c 89 e3 8b 40 18 48 8b 04 c5 40 8c 74
> 80 48 03
> [67325.481393] RIP [<ffffffff80226b1e>] task_rq_lock+0x2e/0x90
> [67325.482617] RSP <ffff81022e14bb00>
> [67325.483356] CR2: 0000000000000008
>
> Message from syslogd@HP1b at Tue Jul 22 02:25:48 2008 ...
>
> Message from syslogd@HP1b at Tue Jul 22 02:25:48 2008 ...
> HP1b kernel: [67325.419114] Oops: 0000 [1] SMP
> HP1b kernel: [67325.483356] CR2: 0000000000000008
>
> Gal.
>
> On Mon, 2008-07-21 at 21:04 +0400, Vladislav Bolkhovitin wrote:
> > Gal Rosen wrote:
> > > This issue occurs when rmmod'ing the qla target then scst modules and
> > > then the qla initiator from a script, constantly, but it is not related
> > > to the target specifically, it happened also when in one shell loading
> > > the qla driver and in other shell rmmod it.
> > >
> > > Running the script that rmmod'ing the modules cause panic. All modules
> > > except the qla2xxx were unloaded successfully. The panic occur because
> > > the qla2xxx_wake_dpc() is not protected well. If this function called
> > > from one of the dpc threads or from interrupt, and at the same time some
> > > one unloading the module, then the pointer to the task structure (the
> > > dpc_thread) will be changed to NULL while the wake_up_process() try to
> > > use it.
> > >
> > > I suggest the following patch:
> > >
> > > diff --git a/qla_os.c b/qla_os.c
> > > index 594c1ed..7f96723 100644
> > > --- a/qla_os.c
> > > +++ b/qla_os.c
> > > @@ -1807,6 +1807,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const
> > > struct pci_device_id *id)
> > > return 0;
> > >
> > > probe_failed:
> > > + ha->flags.host_shutting_down = 1;
> > > qla2x00_free_device(ha);
> > >
> > > scsi_host_put(host);
> > > @@ -1858,8 +1859,8 @@ qla2x00_free_device(scsi_qla_host_t *ha)
> > > * qla2xxx_wake_dpc checks for ->dpc_thread
> > > * so we need to zero it out.
> > > */
> > > - ha->dpc_thread = NULL;
> > > kthread_stop(t);
> > > + ha->dpc_thread = NULL;
> > > }
> > >
> > > if (ha->eft)
> > > @@ -2397,6 +2398,9 @@ qla2x00_do_dpc(void *data)
> > > if (!ha->flags.init_done)
> > > continue;
> > >
> > > + if (ha->flags.host_shutting_down)
> > > + continue;
> > > +
> > > DEBUG3(printk("scsi(%ld): DPC handler\n", ha->host_no));
> > >
> > > ha->dpc_active = 1;
> > > @@ -2568,7 +2572,7 @@ qla2x00_do_dpc(void *data)
> > > void
> > > qla2xxx_wake_dpc(scsi_qla_host_t *ha)
> > > {
> > > - if (ha->dpc_thread)
> > > + if (ha->dpc_thread && !ha->flags.host_shutting_down)
> > > wake_up_process(ha->dpc_thread);
> > > }
> > >
> > > This is not the latest version but basically what I am suggesting is to
> > > check also the shutdown flag in the qla2xxx_wake_dpc() function and in
> > > the dpc while loop, and one more thing which is to set the dpc_thread to
> > > NULL only after kthread_stop().
> > >
> > > Any comments?
> >
> > Unfortunately, it's still racy. I've fixed in in a simpler and more
> > reliable way (r470).
> >
> > Thanks,
> > Vlad
|