Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, s

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Vlad,

I kept digging my issue today, and I had the occasion to have a look at
destination node's /var/log/messages file. In fact, for clarity purposes
during eventual debug phases on my cluster nodes, I have setup an rsyslog
config that redirects various cluster logs from /var/log/messages into
dedicated files : drbd.log, scst.log, ra.log, etc... The files I sent
yesterday were extracted from scst.log for instance.
In normal times, the only entries I used to see in /var/log/messages were
minor ntpd ones... But today, when I looked into it for any reason, I
discovered very weird things that obviously look directly linked to the
symptom we're facing here. It means that what I sent you last night is
incomplete, and probably hardly usable to you as is.
I've extracted the logs corresponding to the time frame of what was related
in my 2 yesterday log files, please have a look into it and eventually let
me know what you think about it, I've never seen this before. I swear it's
not the result of a random character generator, nor an extract from Matrix.
Hardware issue in the end ? If not, that would be a very ugly soft one!!!

The best thing to me would be to temporarily redirect the SCST logs back
again to /var/log/messages and retry a failing failover so that I can
provide you really complete sequence of events, unfortunately the cluster is
in production at the moment and my only test volume is currently used to
host files that will allow me to empty other volumes and transition them
from block dev files to regular xfs files. I may be able to do that
tomorrow.

Thanks for your patience and your help anyhow!

Best regards,

Pascal.

-----Message d'origine-----
De : Vladislav Bolkhovitin [mailto:vs...@vl...] 
Envoyé : mardi 17 juillet 2012 23:08
À : Pascal BERTON
Cc : scs...@li...
Objet : Re: [Scst-devel] SCST backend device activation problems :
scst_translate_lun:FLAG SUSPENDED set, skipping

Hi,

It seems your config has some weird circular dependency, like: backend
devices don't start working never completing received requests until SCST
config done and SCST config can't finish waiting for the devices to
complete.

Logs from the beginning of failover can shed some light on this.

Vlad

Pascal BERTON, on 07/14/2012 08:26 AM wrote:
> Hi all !
>
> I'm currently facing weird problems with SCST, and after days of 
> various experiments and observations, trying to isolate as precisely 
> as possible the problem, I conclude that I now need a hand. Could 
> somebody help me a bit on that ?
>
> Basically, we're running a 2 nodes single-primary DRBD/Pacemaker 
> cluster (kernel version 2.6.32-71.7.1., based from Openfiler 2.99 
> distro) hosting 4 DRBD resources each presented to 4 VMware hosts 
> (ESXi 4.1) using two SCST
> (vdisk_fileio) and ISCSI-SCST targets (version 2.0.0.1 at first, now 
> in
> 2.2.0 but the problem persists) per resource. Resources are spread 
> over the
> 2 nodes, 3 active TB per node overall. DRBD replication link is a dual 
> 10GbE link bonded in LACP (mode 4). Volumes are hardware RAID5 made up 
> of 9 15krpm
> 146 or 300GB SAS drives (I mean, disk IO perf doesn't seem to be in 
> cause)
>
> Basically, the issue is : Cluster starts resources, the 4 DRBD 
> primaries go up, then the 4 pairs of virtual IPs, then the SCST 
> services and things run fine. Until you try to migrate resources back 
> and forth. When you do that, it works once, twice, sometimes even 3 
> times, but then you can see DRBD promoted correctly, then the IPs wake 
> up, but the SCST resource remains stuck down, running into timeout 
> after the configured 60s. At that moment, everything fails back to its 
> former place, as it should. If you try again, same story. In the end, 
> you obtain a cluster but the resource is stuck on a node, unable to 
> failover either manually or, more embarrassing, following a node crash
(Which we inevitably faced recently, thanks Mr Murphy.).
>
> After digging the various logs, what I see is :
>
> -          DRBD does its job 100% correctly
>
> -          Pacemaker seems to do its job, with the resource it has, in the
> state they are in. (I mean, the errors it mentions look normal errors 
> in the global failing context)
>
> -          SCST starts its job, but hangs on the device handling section
> (BTW, my RA agent uses the sysfs interface and is based on Patrick 
> Zwahlen's implementation that I customized a bit, mostly to add more 
> friendly tracing, and also to invert the order of activation : iSCSI 
> target first, then the backend device, instead of the reverse, 
> although I now doubt it has a real impact). Basically, all the iSCSI
target setup stuff runs fine, but then :
>
> o    Either it hangs on backend device creation
>
> o    Or it hangs an LUN 0 assignment
>
>> From that point on, it hangs until the configured start timeout, and 
>> then
> everybody goes back home, however. The backend device that refused to 
> get created correctly has been created and remains, eventually the 
> target directory and even sometimes the LUN0 directory in it too. From 
> that point on, it turns into a good mess, in fact problems start here. 
> After that, any migration try is doomed to failure! If I reboot the 
> node, it will accept a couple of migrations again, and then fail again in
the same manner.

Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, s

Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, skipping