Thread: [SSI-devel] CFS failover stops before rc.sysrecover on this 1 box why?
Brought to you by:
brucewalker,
rogertsang
From: Roger T. <op...@bl...> - 2005-02-26 09:53:38
|
Hi, During my investigation into whether my latest DRBD-SSI code (drbd-0.7.10-ssi_rc*) was causing the problem with CFS failover on SSI-1.2.0-FC2 on the P3 architecture, I noticed that on node down event the same fsck error message for the root filesystem on other failover nodes regardless of architecture. For fsck.ext3 I always get a similar message that goes: failure to open specified device /dev/drbd/0 which is the root chard mount filesystem. Then a paragraph on how the superblock may be corrupt and to try using another superblock. I assume on the failover node fsck is run on the root filesystem after DRBD completes all CLMS callback for its /dev/drbd/* devices and before /etc/rc.d/rc.sysrecover - or so it seems. DRBD registers priority 0 with CLMS. On my AMD64 /etc/rc.d/rc.sysrecover runs soon after seeing that fsck ext fs superblock paragraph. Then after the whole shbang, node down completes. On this particular P3 machine, I sit there for more than 30 seconds and still don't see /etc/rc.d/rc.sysrecover getting called. So I poweroff the machine. I've tried the old drbd-ssi-1.2.0.i686.tar.gz too with no positive results. I'm at a loss because I remember having had no glitches at all (with drbd-ssi-1.2.0.i686.tar.gz) not too long ago when I was doing DRBD failover tests around the time of the SSI-1.2.0 release. I did a DRBD-SSI code review, but nothing much came to mind. I think it's possibly something besides DRBD. If anybody has any clue that would be greatly appreciated. I'm particularly interested in how this affects DRBD perhaps having to do with CLMS service priority assignment (-1,0,1), but I doubt it's that. Maybe there is a semaphore issue during failover and it just happens to be the right time of the year. Thanks. -Roger |
From: Roger T. <op...@bl...> - 2005-02-28 06:04:27
Attachments:
drbd-0.7.10-ssi_rc9_devel.tar.gz
|
I've narrowed the problem down to SSI's cfs_svc.c:cfs_root_failover() spawn /sbin/ckroot.ssi. Fsck from e2fsprogs-1.35-7.1_ssi_3 has an error opening file or directory for /dev/drbd/0 (root filesystem) and outputs superblock error which seems fine, but on this particular node fsck doesn't return in /sbin/ckroot.ssi - ie. waits, no exit code, etc. Upon further testing, I realize that /sbin/ckroot.ssi itself doesn't exit so cfs_root_failover() waits indefinitely for spawn_failover_user_proc() to return. I don't see any kernel panic during repetitive testing, but perhaps only if I waited for CLMS callback timeout. In SSI's cfs_svc.c:cfs_root_failover() I assume we stopped the define NOTYET_DISABLED directive (line 464) in SSI-1.2.0-FC where as it had been defined in SSI-1.1.x-FC on which En-Chiang and I tested DRBD-SSI failover. Maybe the change in which directives were enabled/disabled is causing a problem on some SSI root failover nodes - like on this particular P3. Maybe it has to do with spwan_failover_user_proc(). I haven't checked, but I'll be looking at the CVS to confirm this possibility. Can someone verify this? John/En-chiang? Who is responsible for CFS root filesystem failover? The node that is having this cfs_root_failover problem is a tiny P3 UP 384MB SDRAM running SSI-1.2.0-FC2 and drbd-0.7.10-ssi_rc9_devel.tar.gz (attached for developer testing only). Patch that against drbd-0.7.10 sources. My other machine that I use for testing doesn't have this problem and is AMD64 1GB RAM. Both have identical rootfs of course. Thanks. -Roger > Hi, > > During my investigation into whether my latest DRBD-SSI code > (drbd-0.7.10-ssi_rc*) was causing the problem with CFS failover on SSI-1.2.0-FC2 on the P3 architecture, I noticed that on node down event the same fsck error message for the root filesystem on other failover nodes regardless of architecture. For fsck.ext3 I always get a similar message that goes: failure to open specified device /dev/drbd/0 which is the root chard mount filesystem. Then a paragraph on how the superblock may be corrupt and to try using another superblock. > > I assume on the failover node fsck is run on the root filesystem after DRBD completes all CLMS callback for its /dev/drbd/* devices and before /etc/rc.d/rc.sysrecover - or so it seems. DRBD registers priority 0 with > CLMS. On my AMD64 /etc/rc.d/rc.sysrecover runs soon after seeing that fsck ext fs superblock paragraph. Then after the whole shbang, node down > completes. > > On this particular P3 machine, I sit there for more than 30 seconds and still don't see /etc/rc.d/rc.sysrecover getting called. So I poweroff the > machine. I've tried the old drbd-ssi-1.2.0.i686.tar.gz too with no positive results. I'm at a loss because I remember having had no glitches > at all (with drbd-ssi-1.2.0.i686.tar.gz) not too long ago when I was doing > DRBD failover tests around the time of the SSI-1.2.0 release. I did a DRBD-SSI code review, but nothing much came to mind. I think it's possibly something besides DRBD. > > If anybody has any clue that would be greatly appreciated. I'm > particularly interested in how this affects DRBD perhaps having to do with > CLMS service priority assignment (-1,0,1), but I doubt it's that. Maybe there is a semaphore issue during failover and it just happens to be the right time of the year. > > Thanks. > > > -Roger > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > |
From: Roger T. <op...@bl...> - 2005-02-28 06:14:47
Attachments:
drbd-0.7.10-ssi_rc9_devel.patch.gz
|
There are some files inside that tar that are useful, but the drbd patch wasn't properly created. Try this patch instead. -Roger > I've narrowed the problem down to SSI's cfs_svc.c:cfs_root_failover() > spawn /sbin/ckroot.ssi. Fsck from e2fsprogs-1.35-7.1_ssi_3 has an error > opening file or directory for /dev/drbd/0 (root filesystem) and outputs > superblock error which seems fine, but on this particular node fsck > doesn't return in /sbin/ckroot.ssi - ie. waits, no exit code, etc. Upon > further testing, I realize that /sbin/ckroot.ssi itself doesn't exit so > cfs_root_failover() waits indefinitely for spawn_failover_user_proc() to > return. > > I don't see any kernel panic during repetitive testing, but perhaps only > if I waited for CLMS callback timeout. > > In SSI's cfs_svc.c:cfs_root_failover() I assume we stopped the define > NOTYET_DISABLED directive (line 464) in SSI-1.2.0-FC where as it had been > defined in SSI-1.1.x-FC on which En-Chiang and I tested DRBD-SSI failover. > Maybe the change in which directives were enabled/disabled is causing a > problem on some SSI root failover nodes - like on this particular P3. > Maybe it has to do with spwan_failover_user_proc(). I haven't checked, > but I'll be looking at the CVS to confirm this possibility. > > Can someone verify this? John/En-chiang? Who is responsible for CFS root > filesystem failover? > > The node that is having this cfs_root_failover problem is a tiny P3 UP > 384MB SDRAM running SSI-1.2.0-FC2 and drbd-0.7.10-ssi_rc9_devel.tar.gz > (attached for developer testing only). Patch that against drbd-0.7.10 > sources. > > My other machine that I use for testing doesn't have this problem and is > AMD64 1GB RAM. Both have identical rootfs of course. > > Thanks. > > -Roger > > >> Hi, >> >> During my investigation into whether my latest DRBD-SSI code >> (drbd-0.7.10-ssi_rc*) was causing the problem with CFS failover on > SSI-1.2.0-FC2 on the P3 architecture, I noticed that on node down event > the same fsck error message for the root filesystem on other failover > nodes regardless of architecture. For fsck.ext3 I always get a similar > message that goes: failure to open specified device /dev/drbd/0 which is > the root chard mount filesystem. Then a paragraph on how the superblock > may be corrupt and to try using another superblock. >> >> I assume on the failover node fsck is run on the root filesystem after > DRBD completes all CLMS callback for its /dev/drbd/* devices and before > /etc/rc.d/rc.sysrecover - or so it seems. DRBD registers priority 0 with >> CLMS. On my AMD64 /etc/rc.d/rc.sysrecover runs soon after seeing that > fsck ext fs superblock paragraph. Then after the whole shbang, node down >> completes. >> >> On this particular P3 machine, I sit there for more than 30 seconds and > still don't see /etc/rc.d/rc.sysrecover getting called. So I poweroff the >> machine. I've tried the old drbd-ssi-1.2.0.i686.tar.gz too with no > positive results. I'm at a loss because I remember having had no > glitches >> at all (with drbd-ssi-1.2.0.i686.tar.gz) not too long ago when I was > doing >> DRBD failover tests around the time of the SSI-1.2.0 release. I did a > DRBD-SSI code review, but nothing much came to mind. I think it's > possibly something besides DRBD. >> >> If anybody has any clue that would be greatly appreciated. I'm >> particularly interested in how this affects DRBD perhaps having to do > with >> CLMS service priority assignment (-1,0,1), but I doubt it's that. Maybe > there is a semaphore issue during failover and it just happens to be the > right time of the year. >> >> Thanks. >> >> >> -Roger >> >> >> >> ------------------------------------------------------- >> SF email is sponsored by - The IT Product Guide >> Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >> _______________________________________________ >> ssic-linux-devel mailing list >> ssi...@li... >> https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel >> > > > > > > > > > > > > > > |
From: Brian J. W. <Bri...@hp...> - 2005-02-28 21:00:01
|
Roger Tsang wrote: > Can someone verify this? John/En-chiang? Who is responsible for CFS root > filesystem failover? Since David left, John is now responsible for the in-kernel CFS failover code, along with his many other responsibilities. He's been on vacation since Friday. I don't know if he gets back today or tomorrow. Brian |
From: Aneesh K. <ane...@gm...> - 2005-02-28 07:01:11
|
Hi, On Sat, 26 Feb 2005 04:53:30 -0500 (EST), Roger Tsang <op...@bl...> wrote: > Hi, > > During my investigation into whether my latest DRBD-SSI code > (drbd-0.7.10-ssi_rc*) was causing the problem with CFS failover on > SSI-1.2.0-FC2 on the P3 architecture, I noticed that on node down event > the same fsck error message for the root filesystem on other failover > nodes regardless of architecture. For fsck.ext3 I always get a similar > message that goes: failure to open specified device /dev/drbd/0 which is > the root chard mount filesystem. Then a paragraph on how the superblock > may be corrupt and to try using another superblock. > > I assume on the failover node fsck is run on the root filesystem after > DRBD completes all CLMS callback for its /dev/drbd/* devices and before > /etc/rc.d/rc.sysrecover - or so it seems. Fir the recover script /etc/init.d/rc.sysrecover and then the node down script. /etc/init.d/rc.nodedown. In /etc/init.d/rc.nodedown we declare the node as down. The last line in this script. -aneesh |
From: Roger T. <op...@bl...> - 2005-03-02 10:04:50
|
My latest discovery is that while root failover on the failover node stops at fsck inside /sbin/ckroot.ssi. The node waits indefinitely to continue. That means there is no CLMS timeout and no kernel panic no matter how long I wait. I waited for the CLMS callback 300(?) second timeout. I even waited maybe 10 minutes and this failover node continues to stay up. The machine sits there I can type on the console, but no commands would execute. If I was in shell before failover, I can press enter multiple times after root failover "pauses" at fsck and see the bash prompts. I boot up the initnode that I powered off earlier to trigger root failover. This node coming up would wait indefinitely at "Running pre-root cluster initialization". Perhaps it is waiting for the failover node to return to UP state. Now I want the failover node to join, so I had to manually reboot the failover node. In the end, both nodes are sitting there at "Running pre-root cluster initialization" forever. Interesting CLMS split-brain... -Roger |
From: Roger T. <op...@bl...> - 2005-03-02 10:10:58
|
> My latest discovery is that while root failover on the failover node stops > at fsck inside /sbin/ckroot.ssi. The node waits indefinitely to continue. > That means there is no CLMS timeout and no kernel panic no matter how > long I wait. I waited for the CLMS callback 300(?) second timeout. I > even waited maybe 10 minutes and this failover node continues to stay up. > The machine sits there I can type on the console, but no commands would > execute. If I was in shell before failover, I can press enter multiple > times after root failover "pauses" at fsck and see the bash prompts. > > I boot up the initnode that I powered off earlier to trigger root > failover. This node coming up would wait indefinitely at "Running > pre-root cluster initialization". Perhaps it is waiting for the failover > node to return to UP state. Now I want the failover node to join, so I > had to manually reboot the failover node. In the end, both nodes are > sitting there at "Running pre-root cluster initialization" forever. > Interesting CLMS split-brain... > Or rather... deadlock. |