From: Vladislav B. <vs...@vl...> - 2014-12-19 03:59:24
|
Hi Shahar, It's a good finding! But I can't say that the suggested fix is fully right, because you assume that all SCST_RX_STATUS_ERROR cases are sporadic, hence retriable. It's better instead to set sense in the corresponding place in qla2x00t with SCST_RX_STATUS_ERROR_SENSE_SET instead of SCST_RX_STATUS_ERROR_FATAL, see r5940. Thanks, Vlad shahar.salzman wrote on 12/16/2014 07:51 AM: > Hi, > > We where having some fabric problems which where causing FC HW timeouts. > This was OK, until one of the OS (RHEL 5.8) did not like the sense code, > and remounted the file system (ext3) in read-only mode. > I tracked the problem to the following code: > > void scst_rx_data(struct scst_cmd *cmd, int status, > enum scst_exec_context pref_context) > { > ... > case SCST_RX_STATUS_ERROR_FATAL: > set_bit(SCST_CMD_NO_RESP, &cmd->cmd_flags); > /* go through */ > case SCST_RX_STATUS_ERROR: > scst_set_cmd_error(cmd, > SCST_LOAD_SENSE(scst_sense_hardw_error)); > scst_set_cmd_abnormal_done_state(cmd); > pref_context = SCST_CONTEXT_THREAD; > break; > > Recreating this problem in our lab, I checked how RHEL5.8, and CentOS6.6 > react to different sense codes (on the same path), checking both a > burst, and sporadic (every 10K IOs). > > It took several sporadic IO errors to cause remount in read-only mode by > RHEL5.8, CentOS6.6 was much stabler, and did not seem to notice these > sporadic errors. > Testing the error bursts, both OS remounted the file system in read-only > mode. > > I experimented with a different error code: > ABORTED_COMMAND, 0x44, 0 (Aborted Command - internal target failure) > instead of HARDWARE_ERROR, 0x44, 0 (Hardware Error - internal target > failure) > > In this case both OS behaved the same way, on sporadic errors, no action > was taken, on a burst of these errors, the multipath marked the path as > failed, and moved to the additional path. > On my CentOS6.6 server paths where immediately failed back when I > stopped returning the error burst. > Isn't the above behaviour the desired one for these types of errors? > > What do you think of the following patch which converts the hardware > error to aborted command? > diff --git a/scst/src/scst_targ.c b/scst/src/scst_targ.c > index 2757b72..01cf146 100644 > --- a/scst/src/scst_targ.c > +++ b/scst/src/scst_targ.c > @@ -1507,7 +1507,7 @@ void scst_rx_data(struct scst_cmd *cmd, int status, > /* go through */ > case SCST_RX_STATUS_ERROR: > scst_set_cmd_error(cmd, > - SCST_LOAD_SENSE(scst_sense_hardw_error)); > + SCST_LOAD_SENSE(scst_sense_internal_failure)); > scst_set_cmd_abnormal_done_state(cmd); > pref_context = SCST_CONTEXT_THREAD; > break; > > Shahar > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=164703151&iu=/4140/ostg.clktrk > _______________________________________________ > Scst-devel mailing list > https://lists.sourceforge.net/lists/listinfo/scst-devel > |