Posted to the xCAT mailing list 2/25/2015:
Looks like you may have run across a bug. I guess not many people use site.nodestatus=0, because it looks like the problem has been in xCAT for awhile. The updateflag.awk script that is NOT getting run is in your generated /install/autoinst/<node> file. Somewhere, xCAT creates an entry something like this towards the end of the %post script section: if [ -z "$NODESTATUS" ] || [ "$NODESTATUS" != "0" -a "$NODESTATUS" != "N" -a "$NODESTATUS" != "n" updateflag.awk $MASTER 3002 fi In this case, the conditional should NOT be there, the updateflag.awk should always run. When you run nodeset to create your autoinst file, xCAT uses: /opt/xcat/share/xcat/install/scripts/post.xcat to build this section of the file. So, to experiment, try commenting out the conditional in your autoinst file to force the updateflag.awk call and see if that fixes the install loop. If that works, rather than remembering to edit your autoinst file after each nodeset run, you can change that post.xcat template file so nodeset will generate it correctly for you. Linda Russell Jones ---02/25/2014 12:53:27 PM---I've figured out what it is. site.nodestatus got set to 0 in our configuration. This seems to have From: Russell Jones <russell-list@jonesmail.me> To: xcat-user@lists.sourceforge.net, Date: 02/25/2014 12:53 PM Subject: Re: [xcat-user] Node reinstall loop I've figured out what it is. site.nodestatus got set to 0 in our configuration. This seems to have a side effect of making diskfull nodes enter an install loop. Is this expected behavior? On 2/25/2014 11:08 AM, Russell Jones wrote: > So I put some breadcrumbs in the autoinst file and it seems like there's > a section at the bottom where if NODESTATUS != 0, it will run the > updateflag.awk to flip the node over to boot. I exported NODESTATUS > right before that if statement is ran and it is 0. There's exports for > NODESTATUS above that file that sets it to 0, and I am not seeing > anywhere else where NODESTATUS could have the potential to be set to > anything but 0. > > Thoughts? > > > On 2/25/2014 10:27 AM, Russell Jones wrote: >> Sorry, just for clarification that's /var/log/messages on the node >> showing those messages, not xcat.log. >> >> >> On 2/25/2014 10:20 AM, Russell Jones wrote: >>> Hi all, >>> >>> I have a strange issue with a CentOS 5 compute node that is in a >>> reinstall loop. I've checked the usual things, such as DNS forward and >>> reverse resolution, network configuration, etc, and the node should have >>> no problem talking to it's servicenode/xcatmaster. >>> >>> I've forced the node to boot after an install and am trying to replicate >>> running './updateflag.awk $MASTER 3002 "installstatus booted"' manually >>> to see if it will flip itself over to boot per docs and other mailing >>> list posts I've read. The xcat.log file on the node shows: >>> >>> xcat: ready >>> xcat: done >>> >>> .... everytime I do that, however it still doesn't flip itself over to >>> boot when I check "nodeset $node stat" on the xcatmaster. Neither the >>> service node nor management node are logging anything when I do that. >>> >>> Any ideas on how I can dive further into this and see what's going >>> wrong? Is there a better test to manually replicate the node telling >>> it's master that it is done installing? >>> >>> Thanks! >>> >>> >
Yang Song,
I think the original change was trying to stop the nodestatus update instead of updating the boot status. The nodestatus is updated in 'xcatinstallpost'.
HI Linda,
I am not sure if this is an issue or not, the site.nodestatus is used for large clusters tuning, the purpose is to avoid the updateflag.awk to interact with xcatd, thus reduce the load for network and management node. I agree that the description of site.nodestatus is not good enough. The doc http://sourceforge.net/apps/mediawiki/xcat/index.php?title=Hints_and_Tips_for_Large_Scale_Clusters reads:
nodestatus
If set to 'n', the nodelist.status column will not be updated during the node deployment, node discovery and power operations. Default is 'y', always update nodelist.status. Setting this to 'n' for large clusters can eliminate one node-to-server contact and one xCAT database write operation for each node during node deployment, but you will then need to determine deployment status through some other means.
If the user set the site.nodestatus intentionally for large cluster tuning, we probably should not still have updateflag.awk to interact with xcatd, but for the infinite installation loop, we could doc some procedure:
After the nodes installation starts:
1) for non-UEFI mode, run rsetboot <noderange> hd
2) run nodeset <noderange> offline
We could add more descriptive information for site.nodestatus in the tabdump -d site, but my opionion is that we do not update the code logic.
fixed in 2.8 and 2.9:
commit e071f801b27b98c39d51dac7bb3ca5caf4329175
Merge: bb4ff64 b9d2723
Author: immarvin yangsbj@cn.ibm.com
Date: Tue Apr 29 00:19:43 2014 -0700
commit bb4ff64e3348f7fccdf77b9d53a492f14d7abe86
Author: immarvin yangsbj@cn.ibm.com
Date: Tue Apr 29 00:18:59 2014 -0700
commit 71ed00d1a4bda588ec6795adae251783e81fd9e0
Author: immarvin yangsbj@cn.ibm.com
Date: Tue Apr 29 00:18:59 2014 -0700