[Scst-devel] commands stuck after user space crash

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

I have lately been seeing problems when testing my application which 
works with scst_user.
I am working with SCST 2.1.1-pre on CentOS 6.3 (but I have already seen 
this problem with previous versions.

I am testing crashing my application, and there are scenarios where some 
IOs cannot be released.
A note of importance is that I see this only when the system is under 
heavy load both in terms of IO on the FC (100-200K IOPs), and on the CPUs.
When the application crashes, we attempt to stop IO as fast as possible, 
so we do a "echo 1 > $PATH/abort_isp" for each FC port sysfs subsystem.

I see the following in the /var/log/messages when this happens:
Feb 12 00:42:03 server-09 kernel: [112291.656297] [1923]: scst_user: 
dev_user_process_cleanup:3867:dev ffff8801317ab000 cleanup_done && 
rc==-EAGAIN, but rc1 == 1
Feb 12 00:42:03 server-09 kernel: [112291.656300] [1923]: scst_user: 
dev_user_get_next_cmd:2087:No ready commands, returning -11
Feb 12 00:42:03 server-09 kernel: [112291.658283] [1923]: scst_user: 
dev_user_process_cleanup:3861:Cleanuping dev ffff8801317ab000
Feb 12 00:42:03 server-09 kernel: [112291.658285] [1923]: scst_user: 
dev_user_unjam_dev:2524:Unjamming dev ffff8801317ab000
Feb 12 00:42:03 server-09 kernel: [112291.658288] [1923]: scst_user: 
dev_user_unjam_dev:2547:ucmd ffff8801d57d5968 not sent to user, (state 
4, ref 1), sent_to_user 0, seen_by_user 1, aborted 0, jammed 0, scst_cmd 
(null) this_state_unjammed 0
Feb 12 00:42:03 server-09 kernel: [112291.658292] [1923]: scst_user: 
dev_user_process_cleanup:3867:dev ffff8801317ab000 cleanup_done && 
rc==-EAGAIN, but rc1 == 1
Feb 12 00:42:03 server-09 kernel: [112291.658295] [1923]: scst_user: 
dev_user_get_next_cmd:2087:No ready commands, returning -11
Feb 12 00:42:03 server-09 kernel: [112291.660279] [1923]: scst_user: 
dev_user_process_cleanup:3861:Cleanuping dev ffff8801317ab000
Feb 12 00:42:03 server-09 kernel: [112291.660281] [1923]: scst_user: 
dev_user_unjam_dev:2524:Unjamming dev ffff8801317ab000

This is repeated over and over again at a very high rate. The only way I 
can get the scst_user driver to be removed is rebooting the server.
Unfortunately, my logs where wrapped around so I do not see what 
happened before this...

The system which encountered this problem is using iscsi-scst, 
scst_local, qla2xxx, qla2x00tgt, scst and scst_user as the target stack.
Most of the IO is fibre channel (the iscsi and local drivers are used 
for some internal cluster management and are hardly active).

Have you seen this kind of behaviour before?
Do you understand a scenario in which this can happen?

Thanks,
Shahar