Re: [Scst-devel] Qlogic race condition issue

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sorry, it is my problem; the fix is good.
I am using qla2xxx that supports Multi-ID, and in that code there is
interrupt when ever the number of vport changed. This interrupt call to
qla24xx_report_id_acquisition() and in this function there is a call to
wake_up_process(ha->dpc_thread) which is not protected.
It should call to qla2xxx_wake_dpc(ha) which is now protected.

Thanks,
Gal.

On Tue, 2008-07-22 at 09:44 +0300, Gal Rosen wrote:
> Revision 470 still panic.
> 
> [67324.049497] [4525]: scst: exit_scst:1652:SCST unloaded
> [67324.765692] qla2xxx 0000:06:00.1: LIP reset occured (f8ef).
> [67325.295690] qla2xxx 0000:06:00.1: LIP occured (f8ef).
> [67325.315689] qla2xxx 0000:06:00.1: LOOP UP detected (4 Gbps).
> [67325.415700] Unable to handle kernel NULL pointer dereference at
> 0000000000000008 RIP: 
> [67325.416599]  [<ffffffff80226b1e>] task_rq_lock+0x2e/0x90
> [67325.418171] PGD 22d737067 PUD 22df2d067 PMD 0 
> [67325.419114] Oops: 0000 [1] SMP 
> [67325.419736] CPU 0 
> [67325.420162] Modules linked in: qla2xxx loop scsi_transport_fc
> [67325.421333] Pid: 4382, comm: qla2xxx_12_dpc Tainted: GF
> 2.6.23.8-64bit-fc-nfs #10
> [67325.422998] RIP: 0010:[<ffffffff80226b1e>]  [<ffffffff80226b1e>]
> task_rq_lock+0x2e/0x90
> [67325.424736] RSP: 0018:ffff81022e14bb00  EFLAGS: 00010086
> [67325.425870] RAX: 0000000000000086 RBX: 000000000000000f RCX:
> 0000000000000000
> [67325.427379] RDX: 0000000000000000 RSI: ffff81022e14bb80 RDI:
> 0000000000000000
> [67325.428899] RBP: ffff81022e14bb20 R08: 0000000000000000 R09:
> ffff81022d160010
> [67325.430406] R10: 0000000000000000 R11: ffffffff8050faf0 R12:
> ffffffff807a4000
> [67325.431925] R13: 0000000000000000 R14: ffff81022e14bb80 R15:
> 0000000000000246
> [67325.433433] FS:  0000000000000000(0000) GS:ffffffff8071d000(0000)
> knlGS:0000000000000000
> [67325.435147] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [67325.436371] CR2: 0000000000000008 CR3: 000000022dec0000 CR4:
> 00000000000006e0
> [67325.437890] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [67325.439398] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [67325.440921] Process qla2xxx_12_dpc (pid: 4382, threadinfo
> ffff81022e14a000, task ffff81022c6040c0)
> [67325.442813] Stack:  000000000000000f ffff81022d160000
> 0000000000000000 ffffc20000038000
> [67325.444551]  ffff81022e14bbb0 ffffffff80226ddf ffff81022e14bbd0
> 0000000000000046
> [67325.446143]  0000000000000000 ffff810008abc918 000000002c6042e0
> ffff81022c6040c0
> [67325.447697] Call Trace:
> [67325.448252]  [<ffffffff80226ddf>] try_to_wake_up+0x2f/0x370
> [67325.449485]  [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.450736]
> [<ffffffff8805a9cb>] :qla2xxx:qla24xx_process_response_queue+0x18b/0x1f0
> [67325.452442]  [<ffffffff8805bdd8>] :qla2xxx:qla24xx_intr_handler
> +0x158/0x200
> [67325.453976]  [<ffffffff80237710>] process_timeout+0x0/0x10
> [67325.455196]  [<ffffffff88054b0b>] :qla2xxx:qla2x00_mailbox_command
> +0x1db/0x5d0
> [67325.456787]  [<ffffffff80597980>] thread_return+0x0/0x5c0
> [67325.457978]  [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.459214]  [<ffffffff88056a17>] :qla2xxx:qla2x00_get_adapter_id
> +0x87/0x130
> [67325.460765]  [<ffffffff80237710>] process_timeout+0x0/0x10
> [67325.461983]  [<ffffffff8804fa9f>] :qla2xxx:qla2x00_configure_loop
> +0x33f/0x17a0
> [67325.463573]  [<ffffffff80237854>] lock_timer_base+0x34/0x70
> [67325.464807]  [<ffffffff880568ea>] :qla2xxx:qla2x00_get_retry_cnt
> +0x5a/0x100
> [67325.466333]  [<ffffffff8805882c>] :qla2xxx:__qla2x00_marker
> +0xec/0x110
> [67325.467781]  [<ffffffff880588b0>] :qla2xxx:qla2x00_marker+0x60/0x90
> [67325.469157]  [<ffffffff88051b3e>] :qla2xxx:qla2x00_abort_isp
> +0x24e/0x610
> [67325.470638]  [<ffffffff8804c290>] :qla2xxx:qla2x00_do_dpc+0x430/0x560
> [67325.472052]  [<ffffffff8804be60>] :qla2xxx:qla2x00_do_dpc+0x0/0x560
> [67325.473431]  [<ffffffff8804be60>] :qla2xxx:qla2x00_do_dpc+0x0/0x560
> [67325.474820]  [<ffffffff80242e8b>] kthread+0x4b/0x80
> [67325.475902]  [<ffffffff8020c608>] child_rip+0xa/0x12
> [67325.477005]  [<ffffffff80242e40>] kthread+0x0/0x80
> [67325.478069]  [<ffffffff8020c5fe>] child_rip+0x0/0x12
> [67325.479171] 
> [67325.479486] 
> [67325.479486] Code: 49 8b 45 08 4c 89 e3 8b 40 18 48 8b 04 c5 40 8c 74
> 80 48 03 
> [67325.481393] RIP  [<ffffffff80226b1e>] task_rq_lock+0x2e/0x90
> [67325.482617]  RSP <ffff81022e14bb00>
> [67325.483356] CR2: 0000000000000008
> 
> Message from syslogd@HP1b at Tue Jul 22 02:25:48 2008 ...
> 
> Message from syslogd@HP1b at Tue Jul 22 02:25:48 2008 ...
> HP1b kernel: [67325.419114] Oops: 0000 [1] SMP 
> HP1b kernel: [67325.483356] CR2: 0000000000000008
> 
> Gal.
> 
> On Mon, 2008-07-21 at 21:04 +0400, Vladislav Bolkhovitin wrote:
> > Gal Rosen wrote:
> > > This issue occurs when rmmod'ing the qla target then scst modules and
> > > then the qla initiator from a script, constantly, but it is not related
> > > to the target specifically, it happened also when in one shell loading
> > > the qla driver and in other shell rmmod it.
> > > 
> > > Running the script that rmmod'ing the modules cause panic. All modules
> > > except the qla2xxx were unloaded successfully. The panic occur because
> > > the qla2xxx_wake_dpc() is not protected well. If this function called
> > > from one of the dpc threads or from interrupt, and at the same time some
> > > one unloading the module, then the pointer to the task structure (the
> > > dpc_thread) will be changed to NULL while the wake_up_process() try to
> > > use it.
> > > 
> > > I suggest the following patch:
> > > 
> > >  diff --git a/qla_os.c b/qla_os.c
> > > index 594c1ed..7f96723 100644
> > > --- a/qla_os.c
> > > +++ b/qla_os.c
> > > @@ -1807,6 +1807,7 @@ qla2x00_probe_one(struct pci_dev *pdev, const
> > > struct pci_device_id *id)
> > >         return 0;
> > >  
> > >  probe_failed:
> > > +       ha->flags.host_shutting_down = 1;
> > >         qla2x00_free_device(ha);
> > >  
> > >         scsi_host_put(host);
> > > @@ -1858,8 +1859,8 @@ qla2x00_free_device(scsi_qla_host_t *ha)
> > >                  * qla2xxx_wake_dpc checks for ->dpc_thread
> > >                  * so we need to zero it out.
> > >                  */
> > > -               ha->dpc_thread = NULL;
> > >                 kthread_stop(t);
> > > +               ha->dpc_thread = NULL;
> > >         }
> > >  
> > >         if (ha->eft)
> > > @@ -2397,6 +2398,9 @@ qla2x00_do_dpc(void *data)
> > >                 if (!ha->flags.init_done)
> > >                         continue;
> > >  
> > > +               if (ha->flags.host_shutting_down)
> > > +                       continue;
> > > +
> > >                 DEBUG3(printk("scsi(%ld): DPC handler\n", ha->host_no));
> > >  
> > >                 ha->dpc_active = 1;
> > > @@ -2568,7 +2572,7 @@ qla2x00_do_dpc(void *data)
> > >  void
> > >  qla2xxx_wake_dpc(scsi_qla_host_t *ha)
> > >  {
> > > -       if (ha->dpc_thread)
> > > +       if (ha->dpc_thread && !ha->flags.host_shutting_down)
> > >                 wake_up_process(ha->dpc_thread);
> > >  }
> > > 
> > > This is not the latest version but basically what I am suggesting is to
> > > check also the shutdown flag in the qla2xxx_wake_dpc() function and in
> > > the dpc while loop, and one more thing which is to set the dpc_thread to
> > > NULL only after kthread_stop().
> > > 
> > > Any comments?
> > 
> > Unfortunately, it's still racy. I've fixed in in a simpler and more 
> > reliable way (r470).
> > 
> > Thanks,
> > Vlad