From: <sha...@ka...> - 2013-02-18 12:40:29
|
Hi, I have lately been seeing problems when testing my application which works with scst_user. I am working with SCST 2.1.1-pre on CentOS 6.3 (but I have already seen this problem with previous versions. I am testing crashing my application, and there are scenarios where some IOs cannot be released. A note of importance is that I see this only when the system is under heavy load both in terms of IO on the FC (100-200K IOPs), and on the CPUs. When the application crashes, we attempt to stop IO as fast as possible, so we do a "echo 1 > $PATH/abort_isp" for each FC port sysfs subsystem. I see the following in the /var/log/messages when this happens: Feb 12 00:42:03 server-09 kernel: [112291.656297] [1923]: scst_user: dev_user_process_cleanup:3867:dev ffff8801317ab000 cleanup_done && rc==-EAGAIN, but rc1 == 1 Feb 12 00:42:03 server-09 kernel: [112291.656300] [1923]: scst_user: dev_user_get_next_cmd:2087:No ready commands, returning -11 Feb 12 00:42:03 server-09 kernel: [112291.658283] [1923]: scst_user: dev_user_process_cleanup:3861:Cleanuping dev ffff8801317ab000 Feb 12 00:42:03 server-09 kernel: [112291.658285] [1923]: scst_user: dev_user_unjam_dev:2524:Unjamming dev ffff8801317ab000 Feb 12 00:42:03 server-09 kernel: [112291.658288] [1923]: scst_user: dev_user_unjam_dev:2547:ucmd ffff8801d57d5968 not sent to user, (state 4, ref 1), sent_to_user 0, seen_by_user 1, aborted 0, jammed 0, scst_cmd (null) this_state_unjammed 0 Feb 12 00:42:03 server-09 kernel: [112291.658292] [1923]: scst_user: dev_user_process_cleanup:3867:dev ffff8801317ab000 cleanup_done && rc==-EAGAIN, but rc1 == 1 Feb 12 00:42:03 server-09 kernel: [112291.658295] [1923]: scst_user: dev_user_get_next_cmd:2087:No ready commands, returning -11 Feb 12 00:42:03 server-09 kernel: [112291.660279] [1923]: scst_user: dev_user_process_cleanup:3861:Cleanuping dev ffff8801317ab000 Feb 12 00:42:03 server-09 kernel: [112291.660281] [1923]: scst_user: dev_user_unjam_dev:2524:Unjamming dev ffff8801317ab000 This is repeated over and over again at a very high rate. The only way I can get the scst_user driver to be removed is rebooting the server. Unfortunately, my logs where wrapped around so I do not see what happened before this... The system which encountered this problem is using iscsi-scst, scst_local, qla2xxx, qla2x00tgt, scst and scst_user as the target stack. Most of the IO is fibre channel (the iscsi and local drivers are used for some internal cluster management and are hardly active). Have you seen this kind of behaviour before? Do you understand a scenario in which this can happen? Thanks, Shahar |