Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, s

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Vlad,

Here are my logs from both source and destination node following a manual
failover attempt launched at 03h04:46. Hope it will tell you something... 

In the meantime, I've investigated a couple of things :
- I did reduce to 16 the QueuedCommands parameter on my targets (Not
possible from my vSphere4.1 initiators, requires v5 so I did it from my RA
agent) and I feel its better. I see more Abort stuffs if I reset it back to
32. But the "FLAG SUSPENDED set, skipping" persists. Even if I lower it even
more down to 8.
- I noticed that in Patricks sample implementation, he provides block device
files (i.e. /dev/drbd0) to the vdisk_fileio handler. And after checking
that, I found out that I had done the same thing. Today I have "reshaped" 3
of my 4 targets, so I now have 3 regular files as backend devices (XFS
filesystem) but still 1 /dev DRBD backend. 
Must I consider using device files in a fileio context is a mistake ?
Patricks samples are public, and some way I thought that if it had turned to
be a problem, somebody would have complained ? Don't know... Anyhow, using
regular files can do no harm, right?

Well, all in all none of these 2 changes did make things better, failover is
still manual AND painfull!
Apart from that, I start having doubts on my RA agent especially after your
reply, it starts smelling timing issue. Is there one I could trust fairly
blindly out there ? I've read somebody saying one was included in the SCST
bits, but I must have searched in the wrong place, obviously nothing such in
my scst or iscsi-scst 2.2.0 directories...

Ah, I also had the surprise to discover that the target's eui is an ASCII
equivalent of the backend device file, limited to 8 chars (quite logical for
an eui)... This gave me some fun when changing my block /dev/drbdX by
/drbd0/vol_drbd0, /drbd1/vol_drbd1. I scratched my head a little while... Oh
well... :)

Thanks for your help!

Regards,

Pascal.

-----Message d'origine-----
De : Vladislav Bolkhovitin [mailto:vs...@vl...] 
Envoyé : mardi 17 juillet 2012 23:08
À : Pascal BERTON
Cc : scs...@li...
Objet : Re: [Scst-devel] SCST backend device activation problems :
scst_translate_lun:FLAG SUSPENDED set, skipping

Hi,

It seems your config has some weird circular dependency, like: backend
devices don't start working never completing received requests until SCST
config done and SCST config can't finish waiting for the devices to
complete.

Logs from the beginning of failover can shed some light on this.

Vlad

Pascal BERTON, on 07/14/2012 08:26 AM wrote:
> Hi all !
>
> I'm currently facing weird problems with SCST, and after days of 
> various experiments and observations, trying to isolate as precisely 
> as possible the problem, I conclude that I now need a hand. Could 
> somebody help me a bit on that ?
>
> Basically, we're running a 2 nodes single-primary DRBD/Pacemaker 
> cluster (kernel version 2.6.32-71.7.1., based from Openfiler 2.99 
> distro) hosting 4 DRBD resources each presented to 4 VMware hosts 
> (ESXi 4.1) using two SCST
> (vdisk_fileio) and ISCSI-SCST targets (version 2.0.0.1 at first, now 
> in
> 2.2.0 but the problem persists) per resource. Resources are spread 
> over the
> 2 nodes, 3 active TB per node overall. DRBD replication link is a dual 
> 10GbE link bonded in LACP (mode 4). Volumes are hardware RAID5 made up 
> of 9 15krpm
> 146 or 300GB SAS drives (I mean, disk IO perf doesn't seem to be in 
> cause)
>
> Basically, the issue is : Cluster starts resources, the 4 DRBD 
> primaries go up, then the 4 pairs of virtual IPs, then the SCST 
> services and things run fine. Until you try to migrate resources back 
> and forth. When you do that, it works once, twice, sometimes even 3 
> times, but then you can see DRBD promoted correctly, then the IPs wake 
> up, but the SCST resource remains stuck down, running into timeout 
> after the configured 60s. At that moment, everything fails back to its 
> former place, as it should. If you try again, same story. In the end, 
> you obtain a cluster but the resource is stuck on a node, unable to 
> failover either manually or, more embarrassing, following a node crash
(Which we inevitably faced recently, thanks Mr Murphy.).
>
> After digging the various logs, what I see is :
>
> -          DRBD does its job 100% correctly
>
> -          Pacemaker seems to do its job, with the resource it has, in the
> state they are in. (I mean, the errors it mentions look normal errors 
> in the global failing context)
>
> -          SCST starts its job, but hangs on the device handling section
> (BTW, my RA agent uses the sysfs interface and is based on Patrick 
> Zwahlen's implementation that I customized a bit, mostly to add more 
> friendly tracing, and also to invert the order of activation : iSCSI 
> target first, then the backend device, instead of the reverse, 
> although I now doubt it has a real impact). Basically, all the iSCSI
target setup stuff runs fine, but then :
>
> o    Either it hangs on backend device creation
>
> o    Or it hangs an LUN 0 assignment
>
>> From that point on, it hangs until the configured start timeout, and 
>> then
> everybody goes back home, however. The backend device that refused to 
> get created correctly has been created and remains, eventually the 
> target directory and even sometimes the LUN0 directory in it too. From 
> that point on, it turns into a good mess, in fact problems start here. 
> After that, any migration try is doomed to failure! If I reboot the 
> node, it will accept a couple of migrations again, and then fail again in
the same manner.

Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, s

Re: [Scst-devel] SCST backend device activation problems : scst_translate_lun:FLAG SUSPENDED set, skipping