[Scst-svn] SF.net SVN: scst: [287] trunk

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 287
          http://scst.svn.sourceforge.net/scst/?rev=287&view=rev
Author:   vlnb
Date:     2008-02-13 09:15:47 -0800 (Wed, 13 Feb 2008)

Log Message:
-----------
 - Fixed minor problem in iSCSI-SCST
 - Important reference counting and barriers usage cleanups
 - Sense buffer made dynamic
 - Other minor improvements and cleanups
 - Docs updates

Modified Paths:
--------------
    trunk/iscsi-scst/README
    trunk/iscsi-scst/include/iscsi_u.h
    trunk/iscsi-scst/kernel/iscsi.c
    trunk/iscsi-scst/kernel/iscsi.h
    trunk/iscsi-scst/kernel/nthread.c
    trunk/mpt/mpt_scst.c
    trunk/qla2x00t/qla_init.c
    trunk/scst/README
    trunk/scst/include/scsi_tgt.h
    trunk/scst/include/scst_const.h
    trunk/scst/src/dev_handlers/scst_tape.c
    trunk/scst/src/dev_handlers/scst_user.c
    trunk/scst/src/dev_handlers/scst_vdisk.c
    trunk/scst/src/scst_lib.c
    trunk/scst/src/scst_main.c
    trunk/scst/src/scst_priv.h
    trunk/scst/src/scst_targ.c

Added Paths:
-----------
    trunk/AskingQuestions
    trunk/iscsi-scst/AskingQuestions
    trunk/qla2x00t/qla2x00-target/AskingQuestions
    trunk/scst/AskingQuestions

Added: trunk/AskingQuestions
===================================================================

--- trunk/AskingQuestions	                        (rev 0)
+++ trunk/AskingQuestions	2008-02-13 17:15:47 UTC (rev 287)
@@ -0,0 +1,316 @@
+Before asking any questions to me directly or scst-devel mailing list
+make sure that you read *ALL* relevant documentation files (at least, 2
+README files: one for SCST and one for target driver you are using) and
+*understood* *ALL* written there. I personally very much like working
+with people who understand what they are doing and hate when somebody
+tries to use me as a replacement of his brain and to save his time on
+expense of mine. So, in such cases don't be surprised if your question
+will be ignored or answered in the RTFM style.
+
+Particularly, I will refuse to answer on any questions about low
+performance if you don't *explicitly* write in your question that you
+don't use the debug build and ensured (write from what) that your target
+and backstorage devices don't share the same PCI bus.
+
+Another too FAQ area is "What are those aborts and resets, which your
+target from time to time logging, mean and what to do with them?", "Do
+they relate to I/O stalls I sometimes experience" and "Why after them my
+device was put offline?".
+
+Sorry, if the above might sound too harsh. Unfortunately, I have a
+limited power and can't waste it keeping explaining basic concepts and
+answering on the same questions.
+
+Example of a really bad question:
+
+======================================================================
+
+In our user space driver , i use epoll_wait to wait on multiple file
+descriptors for multiple devices. Apparently when i wait on the ioctl in
+blocking mode , everything works well , but when i wait on epoll , and
+try to  attach a target device , i get immediately a "Bad address" error
+value from the epoll.
+
+What is the reason ?
+
+======================================================================
+
+It is bad, because, apparently, the author was doing something wrong
+with epoll, but instead of checking the source code to find out when
+"Bad address" error can be returned and understand possible reasons for
+it, he expected me to do that for him. He even didn't bothered to look
+in the kernel log, where, very probably, the reason for the error was
+logged.
+
+
+Here are three examples of good questions:
+
+======================================================================
+
+I'm looking for a help in understanding of SCST internal architecture
+and operation. The problem I'm experiencing now is that SCST seems to
+process deferred commands incorrectly in some cases. More specifically,
+I'm confused with the 'while' loop in scst_send_to_midlev function.
+
+As far as I understand, the basic execution path consists of a call to
+scst_do_send_midlev followed by taking of a decision on this command
+(continue with this command, reschedule it, or move to the next one),
+the decision is stored in 'int res', which is then returned from the
+function.
+
+However, if there are deferred commands on the device, the function does
+not return but makes another call to scst_do_send_to_midlev, analyzes
+the return code again and stores the decision in 'int res' thereby
+erasing the decision for the previous command. If scst_send_to_midlev
+exits now, it will return the _new_ decision (for the deferred command)
+whereas the scst_process_active_cmd will think that it is the decision
+for the command that was originally passed to scst_send_to_midlev.
+
+For example, this will cause problems in the following situation:
+1. scst_send_to_midlev is called with cmd == 0x80000100
+2. scst_do_send_to_midlev is called with cmd == 0x8000100
+3. scst_do_send_to_midlev returns with SCST_EXEC_COMPLETED
+   (in certain scenarios the command is already destroyed at this point)
+4. scst_check_deferred_commands finds the defferred cmd == 0x80000200
+5. scst_do_send_to_midlev is called with cmd == 0x80000200
+6. scst_do_send_to_midlev returns with SCST_EXEC_NEED_THREAD
+7. scst_send_to_midlev returns with SCST_CMD_STATE_RES_NEED_THREAD
+8. Now, the scst_process_active_cmd will try to reschedule command 0x8000100
+   which is already destroyed at this point !
+   
+Can anyone on the list confirm my guess? Or, this situation should never
+happen because of some other condition which I may have missed? Right
+now I can't think of any of simple methods to work around the issue,
+i.e. any of my ideas require rewriting significant part of the code.
+
+======================================================================
+
+Hello,
+
+I have two machines (SCST targets) with the following parameters:
+- two dual core Xeon CPUs
+- QLA2342 FC HBA
+- Areca SATA RAID HBA
+- Linux 2.6.21.3, running in 64 bit mode with 16G RAM
+- SCST trunk version
+
+On the client side there is a Solaris 10 U3 machine, with the same (chip 
+wise) Qlogic controller.
+
+There is an FC switch between the three machines, and each of the 
+targets are zoned to the client's port in a one-by-one manner, so HBA 
+port 1 sees only target 1 and port 2 sees only target 2.
+
+The targets are configured with two large sparse files on XFS (8 TB 
+each, with dd if=/dev/zero of=file bs=1M count=0 seek=8388608).
+
+In Solaris I do various tests with SVM (Sun's built in volume manager) 
+and multiterabyte UFS. Occasionally, there are some strange write
+errors, where the volume  manager drops its volumes and without a VM, a
+simple UFS fs write can  fail too.
+
+I see various errors logged by the kernel (Solaris'), these are some 
+examples, both with and without SVM:
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::GPN_ID for D_ID=621200 failed
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::N_x Port with D_ID=621200, PWWN=210000e08b944419 disappeared from 
+fabric
+Jun 21 10:42:53 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:42:53 solaris         SCSI transport failed: reason 
+'tran_err': retrying command
+Jun 21 10:43:06 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:06 solaris         SCSI transport failed: reason 'timeout': 
+retrying command
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.notice]   Device is gone
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:13 solaris         transport rejected fatal error
+Jun 21 10:43:13 solaris md_stripe: [ID 641072 kern.warning] WARNING: md: 
+d10: write error on /dev/dsk/c2t210000E08B944419d0s6
+Jun 21 10:43:13 solaris last message repeated 9 times
+Jun 21 10:43:13 solaris scsi: [ID 243001 kern.info] 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0 (fcp1):
+Jun 21 10:43:13 solaris         offlining lun=0 (trace=0), target=621200 
+(trace=2800004)
+Jun 21 10:43:13 solaris ufs: [ID 702911 kern.warning] WARNING: Error 
+writing master during ufs log roll
+Jun 21 10:43:13 solaris ufs: [ID 127457 kern.warning] WARNING: ufs log 
+for /mnt changed state to Error
+Jun 21 10:43:13 solaris ufs: [ID 616219 kern.warning] WARNING: Please 
+umount(1M) /mnt and run fsck(1M)
+Jun 21 11:08:55 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:08:55 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         offline or reservation conflict
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         SYNCHRONIZE CACHE command failed (5)
+
+I don't see anything in the dmesg on the target side.
+
+After these errors SCST seems to be dead. I can't unload its modules and 
+can't communicate it via /proc.
+A simple cat vdisk just waits and waits.
+
+Could you please help? What should I set/collect/send in this case to 
+help resolving this issue?
+
+======================================================================
+
+Hello,
+
+I am trying to get scst working on an Opteron machine.
+
+After some hours, playing with different kernel versions and different
+missing functions, I've sticked with a 2.6.15 and a
+drivers/scsi/scsi_lib.c hack from 2.6.14, which contains the
+scsi_wait_req. (Linux is a mess, each point release changes something.
+How can developers keep up with this?)
+
+Now everything seems to be OK, I could load the modules and such.
+
+I have a setup of two machines connected to each other in an FC-P2P
+manner. The two machines has two 2G links between them. On the initiator
+side I have FreeBSD, because I know that better and this is what I did
+some target mode tests.
+
+The strange thing is that the loop seems to be only running at 1 Gbps:
+[   61.731265] QLogic Fibre Channel HBA Driver
+[   61.731454] GSI 21 sharing vector 0xD1 and IRQ 21
+[   61.731563] ACPI: PCI Interrupt 0000:06:01.0[A] -> GSI 36 (level, low) -> IRQ 21
+[   61.731821] qla2300 0000:06:01.0: Found an ISP2312, irq 21, iobase 0xffffc200
+00014000
+[   61.732194] qla2300 0000:06:01.0: Configuring PCI space...
+[   61.732441] qla2300 0000:06:01.0: Configure NVRAM parameters...
+[   61.816885] qla2300 0000:06:01.0: Verifying loaded RISC code...
+[   61.852177] qla2300 0000:06:01.0: Extended memory detected (512 KB)...
+[   61.852294] qla2300 0000:06:01.0: Resizing request queue depth (2048 -> 4096)
+...
+[   61.852604] qla2300 0000:06:01.0: LIP reset occured (f8e8).
+[   61.852740] qla2300 0000:06:01.0: Waiting for LIP to complete...
+[   62.865911] qla2300 0000:06:01.0: LIP occured (f7f7).
+[   62.866042] qla2300 0000:06:01.0: LOOP UP detected (1 Gbps).
+[   62.866269] qla2300 0000:06:01.0: Topology - (Loop), Host Loop address 0x0
+[   62.868285] scsi0 : qla2xxx
+[   62.868507] qla2300 0000:06:01.0:
+[   62.868507]  QLogic Fibre Channel HBA Driver: 8.01.03-k
+[   62.868508]   QLogic QLA2312 -
+[   62.868509]   ISP2312: PCI-X (100 MHz) @ 0000:06:01.0 hdma+, host#=0, fw=3.03.18 IPX
+
+
+I did the following:
+modprobe qla2x00tgt:
+
+[  104.988170] qla2x00tgt: no version for "scst_unregister" found: kernel tainted.
+
+echo "open lun0 /data/lun0" >/proc/scsi_tgt/disk_fileio/disk_fileio"
+[  169.102877] scst: Device handler disk_fileio for type 0 loaded successfully
+[  169.103002] scst: Device handler cdrom_fileio for type 5 loaded successfully
+[  191.261000] dev_fileio: Attached SCSI target virtual disk lun0 (file="/data/l
+un0", fs=1000001MB, bs=512, nblocks=2048002048, cyln=1000001)
+[  191.261191] scst: Attached SCSI target mid-level to virtual device lun0 (id 1
+)
+
+and
+echo "add lun0 0" > /proc/scsi_tgt/groups/Default/devices
+
+On the other side a camcontrol rescan all (SCSI rescan) gives me the following with a verbose logging kernel:
+Mar 29 18:09:17 blade2 kernel: pass1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: pass1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: pass1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1 at isp0 bus 0 target 0 lun 0
+Mar 29 18:09:17 blade2 kernel: da1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: da1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: da1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1: 1024MB (2097152 512 byte sectors: 64H 32S/T 1024C)
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): error 6
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): Unretryable Error
+Mar 29 18:09:17 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): error 5
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retries Exausted
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): Unretryable Error
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): error 6
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): Unretryable Error
+
+
+The device is there, but I cannot use it.
+
+BTW, the target mode machine (Linux) runs on a dual Opteron in 64 bit
+mode, with 8GB of RAM. I've lowered it with mem=800M, but the effect is
+the same.
+
+Assuming that mixed 2.6.14-.15 kernel is the fault, could you please
+tell me what version should I use, for which all of the patches will
+work?
+
+======================================================================
+
+So, as a bottom line, if you want me to be friendly, don't ask questions
+answers on which you can find out yourself by a simple documentation
+reading and minimal thinking effort.
+
+Also it is very desirable if you attach to your question full kernel log
+from target since it's booted.
+
+Vladislav Bolkhovitin <vs...@vl...>, http://scst.sourceforge.net

Added: trunk/iscsi-scst/AskingQuestions
===================================================================
--- trunk/iscsi-scst/AskingQuestions	                        (rev 0)
+++ trunk/iscsi-scst/AskingQuestions	2008-02-13 17:15:47 UTC (rev 287)
@@ -0,0 +1,316 @@
+Before asking any questions to me directly or scst-devel mailing list
+make sure that you read *ALL* relevant documentation files (at least, 2
+README files: one for SCST and one for target driver you are using) and
+*understood* *ALL* written there. I personally very much like working
+with people who understand what they are doing and hate when somebody
+tries to use me as a replacement of his brain and to save his time on
+expense of mine. So, in such cases don't be surprised if your question
+will be ignored or answered in the RTFM style.
+
+Particularly, I will refuse to answer on any questions about low
+performance if you don't *explicitly* write in your question that you
+don't use the debug build and ensured (write from what) that your target
+and backstorage devices don't share the same PCI bus.
+
+Another too FAQ area is "What are those aborts and resets, which your
+target from time to time logging, mean and what to do with them?", "Do
+they relate to I/O stalls I sometimes experience" and "Why after them my
+device was put offline?".
+
+Sorry, if the above might sound too harsh. Unfortunately, I have a
+limited power and can't waste it keeping explaining basic concepts and
+answering on the same questions.
+
+Example of a really bad question:
+
+======================================================================
+
+In our user space driver , i use epoll_wait to wait on multiple file
+descriptors for multiple devices. Apparently when i wait on the ioctl in
+blocking mode , everything works well , but when i wait on epoll , and
+try to  attach a target device , i get immediately a "Bad address" error
+value from the epoll.
+
+What is the reason ?
+
+======================================================================
+
+It is bad, because, apparently, the author was doing something wrong
+with epoll, but instead of checking the source code to find out when
+"Bad address" error can be returned and understand possible reasons for
+it, he expected me to do that for him. He even didn't bothered to look
+in the kernel log, where, very probably, the reason for the error was
+logged.
+
+
+Here are three examples of good questions:
+
+======================================================================
+
+I'm looking for a help in understanding of SCST internal architecture
+and operation. The problem I'm experiencing now is that SCST seems to
+process deferred commands incorrectly in some cases. More specifically,
+I'm confused with the 'while' loop in scst_send_to_midlev function.
+
+As far as I understand, the basic execution path consists of a call to
+scst_do_send_midlev followed by taking of a decision on this command
+(continue with this command, reschedule it, or move to the next one),
+the decision is stored in 'int res', which is then returned from the
+function.
+
+However, if there are deferred commands on the device, the function does
+not return but makes another call to scst_do_send_to_midlev, analyzes
+the return code again and stores the decision in 'int res' thereby
+erasing the decision for the previous command. If scst_send_to_midlev
+exits now, it will return the _new_ decision (for the deferred command)
+whereas the scst_process_active_cmd will think that it is the decision
+for the command that was originally passed to scst_send_to_midlev.
+
+For example, this will cause problems in the following situation:
+1. scst_send_to_midlev is called with cmd == 0x80000100
+2. scst_do_send_to_midlev is called with cmd == 0x8000100
+3. scst_do_send_to_midlev returns with SCST_EXEC_COMPLETED
+   (in certain scenarios the command is already destroyed at this point)
+4. scst_check_deferred_commands finds the defferred cmd == 0x80000200
+5. scst_do_send_to_midlev is called with cmd == 0x80000200
+6. scst_do_send_to_midlev returns with SCST_EXEC_NEED_THREAD
+7. scst_send_to_midlev returns with SCST_CMD_STATE_RES_NEED_THREAD
+8. Now, the scst_process_active_cmd will try to reschedule command 0x8000100
+   which is already destroyed at this point !
+   
+Can anyone on the list confirm my guess? Or, this situation should never
+happen because of some other condition which I may have missed? Right
+now I can't think of any of simple methods to work around the issue,
+i.e. any of my ideas require rewriting significant part of the code.
+
+======================================================================
+
+Hello,
+
+I have two machines (SCST targets) with the following parameters:
+- two dual core Xeon CPUs
+- QLA2342 FC HBA
+- Areca SATA RAID HBA
+- Linux 2.6.21.3, running in 64 bit mode with 16G RAM
+- SCST trunk version
+
+On the client side there is a Solaris 10 U3 machine, with the same (chip 
+wise) Qlogic controller.
+
+There is an FC switch between the three machines, and each of the 
+targets are zoned to the client's port in a one-by-one manner, so HBA 
+port 1 sees only target 1 and port 2 sees only target 2.
+
+The targets are configured with two large sparse files on XFS (8 TB 
+each, with dd if=/dev/zero of=file bs=1M count=0 seek=8388608).
+
+In Solaris I do various tests with SVM (Sun's built in volume manager) 
+and multiterabyte UFS. Occasionally, there are some strange write
+errors, where the volume  manager drops its volumes and without a VM, a
+simple UFS fs write can  fail too.
+
+I see various errors logged by the kernel (Solaris'), these are some 
+examples, both with and without SVM:
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::GPN_ID for D_ID=621200 failed
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::N_x Port with D_ID=621200, PWWN=210000e08b944419 disappeared from 
+fabric
+Jun 21 10:42:53 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:42:53 solaris         SCSI transport failed: reason 
+'tran_err': retrying command
+Jun 21 10:43:06 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:06 solaris         SCSI transport failed: reason 'timeout': 
+retrying command
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.notice]   Device is gone
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:13 solaris         transport rejected fatal error
+Jun 21 10:43:13 solaris md_stripe: [ID 641072 kern.warning] WARNING: md: 
+d10: write error on /dev/dsk/c2t210000E08B944419d0s6
+Jun 21 10:43:13 solaris last message repeated 9 times
+Jun 21 10:43:13 solaris scsi: [ID 243001 kern.info] 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0 (fcp1):
+Jun 21 10:43:13 solaris         offlining lun=0 (trace=0), target=621200 
+(trace=2800004)
+Jun 21 10:43:13 solaris ufs: [ID 702911 kern.warning] WARNING: Error 
+writing master during ufs log roll
+Jun 21 10:43:13 solaris ufs: [ID 127457 kern.warning] WARNING: ufs log 
+for /mnt changed state to Error
+Jun 21 10:43:13 solaris ufs: [ID 616219 kern.warning] WARNING: Please 
+umount(1M) /mnt and run fsck(1M)
+Jun 21 11:08:55 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:08:55 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         offline or reservation conflict
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         SYNCHRONIZE CACHE command failed (5)
+
+I don't see anything in the dmesg on the target side.
+
+After these errors SCST seems to be dead. I can't unload its modules and 
+can't communicate it via /proc.
+A simple cat vdisk just waits and waits.
+
+Could you please help? What should I set/collect/send in this case to 
+help resolving this issue?
+
+======================================================================
+
+Hello,
+
+I am trying to get scst working on an Opteron machine.
+
+After some hours, playing with different kernel versions and different
+missing functions, I've sticked with a 2.6.15 and a
+drivers/scsi/scsi_lib.c hack from 2.6.14, which contains the
+scsi_wait_req. (Linux is a mess, each point release changes something.
+How can developers keep up with this?)
+
+Now everything seems to be OK, I could load the modules and such.
+
+I have a setup of two machines connected to each other in an FC-P2P
+manner. The two machines has two 2G links between them. On the initiator
+side I have FreeBSD, because I know that better and this is what I did
+some target mode tests.
+
+The strange thing is that the loop seems to be only running at 1 Gbps:
+[   61.731265] QLogic Fibre Channel HBA Driver
+[   61.731454] GSI 21 sharing vector 0xD1 and IRQ 21
+[   61.731563] ACPI: PCI Interrupt 0000:06:01.0[A] -> GSI 36 (level, low) -> IRQ 21
+[   61.731821] qla2300 0000:06:01.0: Found an ISP2312, irq 21, iobase 0xffffc200
+00014000
+[   61.732194] qla2300 0000:06:01.0: Configuring PCI space...
+[   61.732441] qla2300 0000:06:01.0: Configure NVRAM parameters...
+[   61.816885] qla2300 0000:06:01.0: Verifying loaded RISC code...
+[   61.852177] qla2300 0000:06:01.0: Extended memory detected (512 KB)...
+[   61.852294] qla2300 0000:06:01.0: Resizing request queue depth (2048 -> 4096)
+...
+[   61.852604] qla2300 0000:06:01.0: LIP reset occured (f8e8).
+[   61.852740] qla2300 0000:06:01.0: Waiting for LIP to complete...
+[   62.865911] qla2300 0000:06:01.0: LIP occured (f7f7).
+[   62.866042] qla2300 0000:06:01.0: LOOP UP detected (1 Gbps).
+[   62.866269] qla2300 0000:06:01.0: Topology - (Loop), Host Loop address 0x0
+[   62.868285] scsi0 : qla2xxx
+[   62.868507] qla2300 0000:06:01.0:
+[   62.868507]  QLogic Fibre Channel HBA Driver: 8.01.03-k
+[   62.868508]   QLogic QLA2312 -
+[   62.868509]   ISP2312: PCI-X (100 MHz) @ 0000:06:01.0 hdma+, host#=0, fw=3.03.18 IPX
+
+
+I did the following:
+modprobe qla2x00tgt:
+
+[  104.988170] qla2x00tgt: no version for "scst_unregister" found: kernel tainted.
+
+echo "open lun0 /data/lun0" >/proc/scsi_tgt/disk_fileio/disk_fileio"
+[  169.102877] scst: Device handler disk_fileio for type 0 loaded successfully
+[  169.103002] scst: Device handler cdrom_fileio for type 5 loaded successfully
+[  191.261000] dev_fileio: Attached SCSI target virtual disk lun0 (file="/data/l
+un0", fs=1000001MB, bs=512, nblocks=2048002048, cyln=1000001)
+[  191.261191] scst: Attached SCSI target mid-level to virtual device lun0 (id 1
+)
+
+and
+echo "add lun0 0" > /proc/scsi_tgt/groups/Default/devices
+
+On the other side a camcontrol rescan all (SCSI rescan) gives me the following with a verbose logging kernel:
+Mar 29 18:09:17 blade2 kernel: pass1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: pass1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: pass1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1 at isp0 bus 0 target 0 lun 0
+Mar 29 18:09:17 blade2 kernel: da1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: da1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: da1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1: 1024MB (2097152 512 byte sectors: 64H 32S/T 1024C)
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): error 6
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): Unretryable Error
+Mar 29 18:09:17 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): error 5
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retries Exausted
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): Unretryable Error
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): error 6
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): Unretryable Error
+
+
+The device is there, but I cannot use it.
+
+BTW, the target mode machine (Linux) runs on a dual Opteron in 64 bit
+mode, with 8GB of RAM. I've lowered it with mem=800M, but the effect is
+the same.
+
+Assuming that mixed 2.6.14-.15 kernel is the fault, could you please
+tell me what version should I use, for which all of the patches will
+work?
+
+======================================================================
+
+So, as a bottom line, if you want me to be friendly, don't ask questions
+answers on which you can find out yourself by a simple documentation
+reading and minimal thinking effort.
+
+Also it is very desirable if you attach to your question full kernel log
+from target since it's booted.
+
+Vladislav Bolkhovitin <vs...@vl...>, http://scst.sourceforge.net

Modified: trunk/iscsi-scst/README
===================================================================
--- trunk/iscsi-scst/README	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/iscsi-scst/README	2008-02-13 17:15:47 UTC (rev 287)
@@ -111,8 +111,9 @@
 
 If under high load you experience I/O stalls or see in the kernel log
 abort or reset messages, then try to reduce QueuedCommands parameter in
-iscsi-scstd.conf file for the corresponding target. See also SCST README
-file for more details about that issue.
+iscsi-scstd.conf file for the corresponding target to some lower value,
+like 8 (default is 32). See also SCST README file for more details about
+that issue.
 
 Sometimes, when there are communication problems with initiator(s),
 shutting down iSCSI-SCST can take very long time, up to about 10

Modified: trunk/iscsi-scst/include/iscsi_u.h
===================================================================
--- trunk/iscsi-scst/include/iscsi_u.h	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/iscsi-scst/include/iscsi_u.h	2008-02-13 17:15:47 UTC (rev 287)
@@ -20,7 +20,7 @@
 #include <sys/uio.h>
 #endif
 
-#define ISCSI_VERSION_STRING	"0.9.6/0.4.15r145"
+#define ISCSI_VERSION_STRING	"0.9.6/0.4.15r147"
 
 /* The maximum length of 223 bytes in the RFC. */
 #define ISCSI_NAME_LEN	256

Modified: trunk/iscsi-scst/kernel/iscsi.c
===================================================================
--- trunk/iscsi-scst/kernel/iscsi.c	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/iscsi-scst/kernel/iscsi.c	2008-02-13 17:15:47 UTC (rev 287)
@@ -454,12 +454,14 @@
 
 static inline struct iscsi_cmnd *get_rsp_cmnd(struct iscsi_cmnd *req)
 {
-	struct iscsi_cmnd *res;
+	struct iscsi_cmnd *res = NULL;
 
 	/* Currently this lock isn't needed, but just in case.. */
 	spin_lock_bh(&req->rsp_cmd_lock);
-	res = list_entry(req->rsp_cmd_list.prev, struct iscsi_cmnd,
-		rsp_cmd_list_entry);
+	if (!list_empty(&req->rsp_cmd_list)) {
+		res = list_entry(req->rsp_cmd_list.prev, struct iscsi_cmnd,
+			rsp_cmd_list_entry);
+	}
 	spin_unlock_bh(&req->rsp_cmd_lock);
 
 	return res;
@@ -472,6 +474,8 @@
 	struct iscsi_conn *conn = rsp->conn;
 	struct list_head *pos, *next;
 
+	sBUG_ON(list_empty(send));
+
 	/*
 	 * If we don't remove hashed req cmd from the hash list here, before
 	 * submitting it for transmittion, we will have a race, when for
@@ -618,8 +622,8 @@
 	rsp_hdr->cmd_status = status;
 	rsp_hdr->itt = cmnd_hdr(req)->itt;
 
-	if (status == SAM_STAT_CHECK_CONDITION) {
-		TRACE_DBG("%s", "CHECK_CONDITION");
+	if (SCST_SENSE_VALID(sense_buf)) {
+		TRACE_DBG("%s", "SENSE VALID");
 		/* ToDo: __GFP_NOFAIL ?? */
 		sg = rsp->sg = scst_alloc(PAGE_SIZE, GFP_KERNEL|__GFP_NOFAIL,
 					&rsp->sg_cnt);
@@ -920,12 +924,13 @@
 	TRACE_DBG("%p", req);
 
 	rsp = get_rsp_cmnd(req);
+	if (rsp == NULL)
+		goto skip;
+
 	rsp_hdr = (struct iscsi_scsi_rsp_hdr *)&rsp->pdu.bhs;
-	if (unlikely(cmnd_opcode(rsp) != ISCSI_OP_SCSI_RSP)) {
-		PRINT_ERROR("Unexpected response command %u", cmnd_opcode(rsp));
-		return;
-	}
 
+	sBUG_ON(cmnd_opcode(rsp) != ISCSI_OP_SCSI_RSP);
+
 	size = cmnd_write_size(req);
 	if (size) {
 		rsp_hdr->flags |= ISCSI_FLG_RESIDUAL_UNDERFLOW;
@@ -941,6 +946,8 @@
 			rsp_hdr->residual_count = cpu_to_be32(size);
 		}
 	}
+
+skip:
 	req->pdu.bhs.opcode =
 		(req->pdu.bhs.opcode & ~ISCSI_OPCODE_MASK) | ISCSI_OP_SCSI_REJECT;
 
@@ -1285,7 +1292,6 @@
 	if (unlikely(req->scst_state != ISCSI_CMD_STATE_AFTER_PREPROC)) {
 		TRACE_DBG("req %p is in %x state", req, req->scst_state);
 		if (req->scst_state == ISCSI_CMD_STATE_PROCESSED) {
-			/* Response is already prepared */
 			cmnd_prepare_skip_pdu_set_resid(req);
 			goto out;
 		}
@@ -1437,7 +1443,8 @@
 
 	iscsi_extracheck_is_rd_thread(cmnd->conn);
 
-	if (!(cmnd->conn->ddigest_type & DIGEST_NONE)) {
+	if (!(cmnd->conn->ddigest_type & DIGEST_NONE) &&
+	    !cmnd->ddigest_checked) {
 		cmd_add_on_rx_ddigest_list(req, cmnd);
 		cmnd_get(cmnd);
 	}
@@ -1900,12 +1907,16 @@
 		logout_exec(cmnd);
 		break;
 	case ISCSI_OP_SCSI_REJECT:
-		TRACE_MGMT_DBG("REJECT cmnd %p (scst_cmd %p)", cmnd,
-			cmnd->scst_cmd);
-		iscsi_cmnd_init_write(get_rsp_cmnd(cmnd),
-			ISCSI_INIT_WRITE_REMOVE_HASH | ISCSI_INIT_WRITE_WAKE);
+	{
+		struct iscsi_cmnd *rsp = get_rsp_cmnd(cmnd);
+		TRACE_MGMT_DBG("REJECT cmnd %p (scst_cmd %p), rsp %p", cmnd,
+			cmnd->scst_cmd, rsp);
+		if (rsp != NULL)
+			iscsi_cmnd_init_write(rsp, ISCSI_INIT_WRITE_REMOVE_HASH |
+							 ISCSI_INIT_WRITE_WAKE);
 		req_cmnd_release(cmnd);
 		break;
+	}
 	default:
 		PRINT_ERROR("unexpected cmnd op %x", cmnd_opcode(cmnd));
 		req_cmnd_release(cmnd);
@@ -2281,10 +2292,14 @@
 		data_out_end(cmnd);
 		break;
 	case ISCSI_OP_PDU_REJECT:
-		iscsi_cmnd_init_write(get_rsp_cmnd(cmnd),
-			ISCSI_INIT_WRITE_REMOVE_HASH | ISCSI_INIT_WRITE_WAKE);
+	{
+		struct iscsi_cmnd *rsp = get_rsp_cmnd(cmnd);
+		if (rsp != NULL)
+			iscsi_cmnd_init_write(rsp, ISCSI_INIT_WRITE_REMOVE_HASH |
+							ISCSI_INIT_WRITE_WAKE);
 		req_cmnd_release(cmnd);
 		break;
+	}
 	case ISCSI_OP_DATA_REJECT:
 		req_cmnd_release(cmnd);
 		break;

Modified: trunk/iscsi-scst/kernel/iscsi.h
===================================================================
--- trunk/iscsi-scst/kernel/iscsi.h	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/iscsi-scst/kernel/iscsi.h	2008-02-13 17:15:47 UTC (rev 287)
@@ -237,6 +237,7 @@
 	unsigned int data_waiting:1;
 	unsigned int force_cleanup_done:1;
 	unsigned int dec_active_cmnds:1;
+	unsigned int ddigest_checked:1;
 #ifdef EXTRACHECKS
 	unsigned int on_rx_digest_list:1;
 	unsigned int release_called:1;

Modified: trunk/iscsi-scst/kernel/nthread.c
===================================================================
--- trunk/iscsi-scst/kernel/nthread.c	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/iscsi-scst/kernel/nthread.c	2008-02-13 17:15:47 UTC (rev 287)
@@ -622,7 +622,17 @@
 			break;
 	case RX_CHECK_DDIGEST:
 		conn->read_state = RX_END;
-		if (cmnd_opcode(cmnd) == ISCSI_OP_SCSI_CMD) {
+		if (cmnd->pdu.datasize <= 16*1024) {
+			/* It's cache hot, so let's compute it inline */
+			TRACE_DBG("cmnd %p, opcode %x: checking RX "
+				"ddigest inline", cmnd, cmnd_opcode(cmnd));
+			cmnd->ddigest_checked = 1;
+			rc = digest_rx_data(cmnd);
+			if (unlikely(rc != 0)) {
+				mark_conn_closed(conn);
+				goto out;
+			}
+		} else if (cmnd_opcode(cmnd) == ISCSI_OP_SCSI_CMD) {
 			cmd_add_on_rx_ddigest_list(cmnd, cmnd);
 			cmnd_get(cmnd);
 		} else if (cmnd_opcode(cmnd) != ISCSI_OP_SCSI_DATA_OUT) {
@@ -631,12 +641,12 @@
 			 * specify how to deal with digest errors in this case.
 			 * Is closing connection correct?
 			 */
-			TRACE_DBG("cmnd %p, opcode %x: checking RX "
-				"ddigest inline", cmnd, cmnd_opcode(cmnd));
+			TRACE_DBG("cmnd %p, opcode %x: checking NOP RX "
+				"ddigest", cmnd, cmnd_opcode(cmnd));
 			rc = digest_rx_data(cmnd);
 			if (unlikely(rc != 0)) {
-				conn->read_state = RX_CHECK_DDIGEST;
 				mark_conn_closed(conn);
+				goto out;
 			}
 		}
 		break;

Modified: trunk/mpt/mpt_scst.c
===================================================================
--- trunk/mpt/mpt_scst.c	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/mpt/mpt_scst.c	2008-02-13 17:15:47 UTC (rev 287)
@@ -1611,7 +1611,7 @@
 
 	TRACE_DBG("rq_result=%x, resp_flags=%x, %x, %d", prm.rq_result, 
 			resp_flags, prm.bufflen, prm.sense_buffer_len);
-	if (prm.rq_result != 0)
+	if ((prm.rq_result != 0) && (prm.sense_buffer != NULL))
 		TRACE_BUFFER("Sense", prm.sense_buffer, prm.sense_buffer_len);
 
 	if ((resp_flags & SCST_TSC_FLAG_STATUS) == 0) {

Added: trunk/qla2x00t/qla2x00-target/AskingQuestions
===================================================================
--- trunk/qla2x00t/qla2x00-target/AskingQuestions	                        (rev 0)
+++ trunk/qla2x00t/qla2x00-target/AskingQuestions	2008-02-13 17:15:47 UTC (rev 287)
@@ -0,0 +1,316 @@
+Before asking any questions to me directly or scst-devel mailing list
+make sure that you read *ALL* relevant documentation files (at least, 2
+README files: one for SCST and one for target driver you are using) and
+*understood* *ALL* written there. I personally very much like working
+with people who understand what they are doing and hate when somebody
+tries to use me as a replacement of his brain and to save his time on
+expense of mine. So, in such cases don't be surprised if your question
+will be ignored or answered in the RTFM style.
+
+Particularly, I will refuse to answer on any questions about low
+performance if you don't *explicitly* write in your question that you
+don't use the debug build and ensured (write from what) that your target
+and backstorage devices don't share the same PCI bus.
+
+Another too FAQ area is "What are those aborts and resets, which your
+target from time to time logging, mean and what to do with them?", "Do
+they relate to I/O stalls I sometimes experience" and "Why after them my
+device was put offline?".
+
+Sorry, if the above might sound too harsh. Unfortunately, I have a
+limited power and can't waste it keeping explaining basic concepts and
+answering on the same questions.
+
+Example of a really bad question:
+
+======================================================================
+
+In our user space driver , i use epoll_wait to wait on multiple file
+descriptors for multiple devices. Apparently when i wait on the ioctl in
+blocking mode , everything works well , but when i wait on epoll , and
+try to  attach a target device , i get immediately a "Bad address" error
+value from the epoll.
+
+What is the reason ?
+
+======================================================================
+
+It is bad, because, apparently, the author was doing something wrong
+with epoll, but instead of checking the source code to find out when
+"Bad address" error can be returned and understand possible reasons for
+it, he expected me to do that for him. He even didn't bothered to look
+in the kernel log, where, very probably, the reason for the error was
+logged.
+
+
+Here are three examples of good questions:
+
+======================================================================
+
+I'm looking for a help in understanding of SCST internal architecture
+and operation. The problem I'm experiencing now is that SCST seems to
+process deferred commands incorrectly in some cases. More specifically,
+I'm confused with the 'while' loop in scst_send_to_midlev function.
+
+As far as I understand, the basic execution path consists of a call to
+scst_do_send_midlev followed by taking of a decision on this command
+(continue with this command, reschedule it, or move to the next one),
+the decision is stored in 'int res', which is then returned from the
+function.
+
+However, if there are deferred commands on the device, the function does
+not return but makes another call to scst_do_send_to_midlev, analyzes
+the return code again and stores the decision in 'int res' thereby
+erasing the decision for the previous command. If scst_send_to_midlev
+exits now, it will return the _new_ decision (for the deferred command)
+whereas the scst_process_active_cmd will think that it is the decision
+for the command that was originally passed to scst_send_to_midlev.
+
+For example, this will cause problems in the following situation:
+1. scst_send_to_midlev is called with cmd == 0x80000100
+2. scst_do_send_to_midlev is called with cmd == 0x8000100
+3. scst_do_send_to_midlev returns with SCST_EXEC_COMPLETED
+   (in certain scenarios the command is already destroyed at this point)
+4. scst_check_deferred_commands finds the defferred cmd == 0x80000200
+5. scst_do_send_to_midlev is called with cmd == 0x80000200
+6. scst_do_send_to_midlev returns with SCST_EXEC_NEED_THREAD
+7. scst_send_to_midlev returns with SCST_CMD_STATE_RES_NEED_THREAD
+8. Now, the scst_process_active_cmd will try to reschedule command 0x8000100
+   which is already destroyed at this point !
+   
+Can anyone on the list confirm my guess? Or, this situation should never
+happen because of some other condition which I may have missed? Right
+now I can't think of any of simple methods to work around the issue,
+i.e. any of my ideas require rewriting significant part of the code.
+
+======================================================================
+
+Hello,
+
+I have two machines (SCST targets) with the following parameters:
+- two dual core Xeon CPUs
+- QLA2342 FC HBA
+- Areca SATA RAID HBA
+- Linux 2.6.21.3, running in 64 bit mode with 16G RAM
+- SCST trunk version
+
+On the client side there is a Solaris 10 U3 machine, with the same (chip 
+wise) Qlogic controller.
+
+There is an FC switch between the three machines, and each of the 
+targets are zoned to the client's port in a one-by-one manner, so HBA 
+port 1 sees only target 1 and port 2 sees only target 2.
+
+The targets are configured with two large sparse files on XFS (8 TB 
+each, with dd if=/dev/zero of=file bs=1M count=0 seek=8388608).
+
+In Solaris I do various tests with SVM (Sun's built in volume manager) 
+and multiterabyte UFS. Occasionally, there are some strange write
+errors, where the volume  manager drops its volumes and without a VM, a
+simple UFS fs write can  fail too.
+
+I see various errors logged by the kernel (Solaris'), these are some 
+examples, both with and without SVM:
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::GPN_ID for D_ID=621200 failed
+Jun 21 10:42:14 solaris fctl: [ID 517869 kern.warning] WARNING: 
+fp(1)::N_x Port with D_ID=621200, PWWN=210000e08b944419 disappeared from 
+fabric
+Jun 21 10:42:53 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:42:53 solaris         SCSI transport failed: reason 
+'tran_err': retrying command
+Jun 21 10:43:06 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:06 solaris         SCSI transport failed: reason 'timeout': 
+retrying command
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.notice]   Device is gone
+Jun 21 10:43:13 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 10:43:13 solaris         transport rejected fatal error
+Jun 21 10:43:13 solaris md_stripe: [ID 641072 kern.warning] WARNING: md: 
+d10: write error on /dev/dsk/c2t210000E08B944419d0s6
+Jun 21 10:43:13 solaris last message repeated 9 times
+Jun 21 10:43:13 solaris scsi: [ID 243001 kern.info] 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0 (fcp1):
+Jun 21 10:43:13 solaris         offlining lun=0 (trace=0), target=621200 
+(trace=2800004)
+Jun 21 10:43:13 solaris ufs: [ID 702911 kern.warning] WARNING: Error 
+writing master during ufs log roll
+Jun 21 10:43:13 solaris ufs: [ID 127457 kern.warning] WARNING: ufs log 
+for /mnt changed state to Error
+Jun 21 10:43:13 solaris ufs: [ID 616219 kern.warning] WARNING: Please 
+umount(1M) /mnt and run fsck(1M)
+Jun 21 11:08:55 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:08:55 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         offline or reservation conflict
+Jun 21 11:09:41 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:41 solaris         i/o to invalid geometry
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         offline or reservation conflict
+Jun 21 11:09:43 solaris scsi: [ID 107833 kern.warning] WARNING: 
+/pci@1,0/pci1022,7450@a/pcie11,105@1,1/fp@0,0/disk@w210000e08b944419,0 
+(sd1):
+Jun 21 11:09:43 solaris         SYNCHRONIZE CACHE command failed (5)
+
+I don't see anything in the dmesg on the target side.
+
+After these errors SCST seems to be dead. I can't unload its modules and 
+can't communicate it via /proc.
+A simple cat vdisk just waits and waits.
+
+Could you please help? What should I set/collect/send in this case to 
+help resolving this issue?
+
+======================================================================
+
+Hello,
+
+I am trying to get scst working on an Opteron machine.
+
+After some hours, playing with different kernel versions and different
+missing functions, I've sticked with a 2.6.15 and a
+drivers/scsi/scsi_lib.c hack from 2.6.14, which contains the
+scsi_wait_req. (Linux is a mess, each point release changes something.
+How can developers keep up with this?)
+
+Now everything seems to be OK, I could load the modules and such.
+
+I have a setup of two machines connected to each other in an FC-P2P
+manner. The two machines has two 2G links between them. On the initiator
+side I have FreeBSD, because I know that better and this is what I did
+some target mode tests.
+
+The strange thing is that the loop seems to be only running at 1 Gbps:
+[   61.731265] QLogic Fibre Channel HBA Driver
+[   61.731454] GSI 21 sharing vector 0xD1 and IRQ 21
+[   61.731563] ACPI: PCI Interrupt 0000:06:01.0[A] -> GSI 36 (level, low) -> IRQ 21
+[   61.731821] qla2300 0000:06:01.0: Found an ISP2312, irq 21, iobase 0xffffc200
+00014000
+[   61.732194] qla2300 0000:06:01.0: Configuring PCI space...
+[   61.732441] qla2300 0000:06:01.0: Configure NVRAM parameters...
+[   61.816885] qla2300 0000:06:01.0: Verifying loaded RISC code...
+[   61.852177] qla2300 0000:06:01.0: Extended memory detected (512 KB)...
+[   61.852294] qla2300 0000:06:01.0: Resizing request queue depth (2048 -> 4096)
+...
+[   61.852604] qla2300 0000:06:01.0: LIP reset occured (f8e8).
+[   61.852740] qla2300 0000:06:01.0: Waiting for LIP to complete...
+[   62.865911] qla2300 0000:06:01.0: LIP occured (f7f7).
+[   62.866042] qla2300 0000:06:01.0: LOOP UP detected (1 Gbps).
+[   62.866269] qla2300 0000:06:01.0: Topology - (Loop), Host Loop address 0x0
+[   62.868285] scsi0 : qla2xxx
+[   62.868507] qla2300 0000:06:01.0:
+[   62.868507]  QLogic Fibre Channel HBA Driver: 8.01.03-k
+[   62.868508]   QLogic QLA2312 -
+[   62.868509]   ISP2312: PCI-X (100 MHz) @ 0000:06:01.0 hdma+, host#=0, fw=3.03.18 IPX
+
+
+I did the following:
+modprobe qla2x00tgt:
+
+[  104.988170] qla2x00tgt: no version for "scst_unregister" found: kernel tainted.
+
+echo "open lun0 /data/lun0" >/proc/scsi_tgt/disk_fileio/disk_fileio"
+[  169.102877] scst: Device handler disk_fileio for type 0 loaded successfully
+[  169.103002] scst: Device handler cdrom_fileio for type 5 loaded successfully
+[  191.261000] dev_fileio: Attached SCSI target virtual disk lun0 (file="/data/l
+un0", fs=1000001MB, bs=512, nblocks=2048002048, cyln=1000001)
+[  191.261191] scst: Attached SCSI target mid-level to virtual device lun0 (id 1
+)
+
+and
+echo "add lun0 0" > /proc/scsi_tgt/groups/Default/devices
+
+On the other side a camcontrol rescan all (SCSI rescan) gives me the following with a verbose logging kernel:
+Mar 29 18:09:17 blade2 kernel: pass1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: pass1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: pass1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1 at isp0 bus 0 target 0 lun 0
+Mar 29 18:09:17 blade2 kernel: da1: <SCST_FIO lun0 093> Fixed Direct Access SCSI-4 device
+Mar 29 18:09:17 blade2 kernel: da1: Serial Number 383
+Mar 29 18:09:17 blade2 kernel: da1: 100.000MB/s transfers
+Mar 29 18:09:17 blade2 kernel: da1: 1024MB (2097152 512 byte sectors: 64H 32S/T 1024C)
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): error 6
+Mar 29 18:09:17 blade2 kernel: (probe0:isp0:0:0:1): Unretryable Error
+Mar 29 18:09:17 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:17 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:2): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:3): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:4): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retrying Command
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:5): Unretryable Error
+Mar 29 18:09:18 blade2 kernel: isp0: data overrun for command on 0.0.0
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Data Overrun
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): error 5
+Mar 29 18:09:18 blade2 kernel: (da1:isp0:0:0:0): Retries Exausted
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): error 6
+Mar 29 18:09:18 blade2 kernel: (probe0:isp0:0:0:6): Unretryable Error
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): error 6
+Mar 29 18:09:19 blade2 kernel: (probe0:isp0:0:0:7): Unretryable Error
+
+
+The device is there, but I cannot use it.
+
+BTW, the target mode machine (Linux) runs on a dual Opteron in 64 bit
+mode, with 8GB of RAM. I've lowered it with mem=800M, but the effect is
+the same.
+
+Assuming that mixed 2.6.14-.15 kernel is the fault, could you please
+tell me what version should I use, for which all of the patches will
+work?
+
+======================================================================
+
+So, as a bottom line, if you want me to be friendly, don't ask questions
+answers on which you can find out yourself by a simple documentation
+reading and minimal thinking effort.
+
+Also it is very desirable if you attach to your question full kernel log
+from target since it's booted.
+
+Vladislav Bolkhovitin <vs...@vl...>, http://scst.sourceforge.net

Modified: trunk/qla2x00t/qla_init.c
===================================================================
--- trunk/qla2x00t/qla_init.c	2008-02-13 16:28:35 UTC (rev 286)
+++ trunk/qla2x00t/qla_init.c	2008-02-13 17:15:47 UTC (rev 287)
@@ -4135,8 +4135,9 @@
 
 	ENTER(__func__);
 	
-	if ((tgt_data == NULL) || (tgt_data->magic != QLA2X_TARGET_MAGIC))
-	{
+	if ((tgt_data == NULL) || (tgt_data->magic != QLA2X_TARGET_MAGIC)) {
+		printk("***ERROR*** Wrong version of the target driver: %d\n",
+			tgt_data->magic);
 		res = -EINVAL;
 		goto out;
 	}

Added: trunk/scst/AskingQuestions
===================================================================
--- trunk/scst/AskingQuestions	                        (rev 0)
+++ trunk/scst/AskingQuestions	2008-02-13 17:15:47 UTC (rev 287)
@@ -0,0 +1,316 @@
+Before asking any questions to me directly or scst-devel mailing list
+make sure that you read *ALL* relevant documentation files (at least, 2
+README files: one for SCST and one for target driver you are using) and
+*understood* *ALL* written there. I personally very much like working
+with people who understand what they are doing and hate when somebody
+tries to use me as a replacement of his brain and to save his time on
+expense of mine. So, in such cases don't be surprised if your question
+will be ignored or answered in the RTFM style.
+
+Particularly, I will refuse to answer on any questions about low
+performance if you don't *explicitly* write in your question that you
+don't use the debug build and ensured (write from what) that your target
+and backstorage devices don't share the same PCI bus.
+
+Another too FAQ area is "What are those aborts and resets, which your
+target from time to time logging, mean and what to do with them?", "Do
+they relate to I/O stalls I sometimes experience" and "Why after them my
+device was put offline?".
+
+Sorry, if the above might sound too harsh. Unfortunately, I have a
+limited power and can't waste it keeping explaining basic concepts and
+answering on the same questions.
+
+Example of a really bad question:
+
+======================================================================
+
+In our user space driver , i use epoll_wait to wait on multiple file
+descriptors for multiple devices. Apparently when i wait on the ioctl in
+blocking mode , everything works well , but when i wait on epoll , and
+try to  attach a target device , i get immediately a "Bad address" error
+value from the epoll.
+
+What is the reason ?
+
+======================================================================
+
+It is bad, because, apparently, the author was doing something wrong
+with epoll, but instead of checking the source code to find out when
+"Bad address" error can be returned and understand possible reasons for
+it, he expected me to do that for him. He even didn't bothered to look
+in the kernel log, where, very probably, the reason for the error was
+logged.
+
+
+Here are three examples of good questions:
+
+======================================================================
+
+I'm looking for a help in understanding of SCST internal architecture
+and operation. The problem I'm experiencing now is that SCST seems to
+process deferred commands incorrectly in some cases. More specifically,
+I'm confused with the 'while' loop in scst_send_to_midlev function.
+
+As far as I understand, the basic execution path consists of a call to
+scst_do_send_midlev followed by taking of a decision on this command
+(continue with this command, reschedule it, or move to the next one),
+the decision is stored in 'int res', which is then returned from the
+function.
+
+However, if there are deferred commands on the device, the function does
+not return but makes another call to scst_do_send_to_midlev, analyzes
+the return code again and stores the decision in 'int res' thereby
+erasing the decision for the previous command. If scst_send_to_midlev
+exits now, it will return the _new_ decision (for the deferred command)
+whereas the scst_process_active_cmd will think that it...
 
[truncated message content]