Menu

#3997 node reinstall loop when site.nodestatus=0

2.8.4
closed
yangsong
None
linux provisioning
5
2015-02-16
2014-02-25
No

Posted to the xCAT mailing list 2/25/2015:

Looks like you may have run across a bug.  I guess not many people use site.nodestatus=0, because it looks like the problem has been in xCAT for awhile.

The updateflag.awk script that is NOT getting run is in your generated /install/autoinst/<node> file.    Somewhere, xCAT creates an entry something like this towards the end of the %post script section:
if [ -z "$NODESTATUS" ] || [ "$NODESTATUS" != "0" -a   "$NODESTATUS" != "N"  -a  "$NODESTATUS" != "n" 
      updateflag.awk $MASTER 3002
fi

In this case, the conditional should NOT be there, the updateflag.awk should always run.

When you run nodeset to create your autoinst file, xCAT uses:
   /opt/xcat/share/xcat/install/scripts/post.xcat
to build this section of the file. 

So, to experiment, try commenting out the conditional in your autoinst file to force the updateflag.awk call and see if that fixes the install loop.  If that works, rather than remembering to edit your autoinst file after each nodeset run, you can change that post.xcat template file so nodeset will generate it correctly for you.

Linda

Russell Jones ---02/25/2014 12:53:27 PM---I've figured out what it is. site.nodestatus got set to 0 in our  configuration. This seems to have

From: Russell Jones <russell-list@jonesmail.me>
To: xcat-user@lists.sourceforge.net, 
Date: 02/25/2014 12:53 PM
Subject: Re: [xcat-user] Node reinstall loop

I've figured out what it is. site.nodestatus got set to 0 in our 
configuration. This seems to have a side effect of making diskfull nodes 
enter an install loop.

Is this expected behavior?

On 2/25/2014 11:08 AM, Russell Jones wrote:
> So I put some breadcrumbs in the autoinst file and it seems like there's
> a section at the bottom where if NODESTATUS != 0, it will run the
> updateflag.awk to flip the node over to boot. I exported NODESTATUS
> right before that if statement is ran and it is 0. There's exports for
> NODESTATUS above that file that sets it to 0, and I am not seeing
> anywhere else where NODESTATUS could have the potential to be set to
> anything but 0.
>
> Thoughts?
>
>
> On 2/25/2014 10:27 AM, Russell Jones wrote:
>> Sorry, just for clarification that's /var/log/messages on the node
>> showing those messages, not xcat.log.
>>
>>
>> On 2/25/2014 10:20 AM, Russell Jones wrote:
>>> Hi all,
>>>
>>> I have a strange issue with a CentOS 5 compute node that is in a
>>> reinstall loop. I've checked the usual things, such as DNS forward and
>>> reverse resolution, network configuration, etc, and the node should have
>>> no problem talking to it's servicenode/xcatmaster.
>>>
>>> I've forced the node to boot after an install and am trying to replicate
>>> running './updateflag.awk $MASTER 3002 "installstatus booted"' manually
>>> to see if it will flip itself over to boot per docs and other mailing
>>> list posts I've read. The xcat.log file on the node shows:
>>>
>>> xcat: ready
>>> xcat: done
>>>
>>> .... everytime I do that, however it still doesn't flip itself over to
>>> boot when I check "nodeset $node stat" on the xcatmaster. Neither the
>>> service node nor management node are logging anything when I do that.
>>>
>>> Any ideas on how I can dive further into this and see what's going
>>> wrong? Is there a better test to manually replicate the node telling
>>> it's master that it is done installing?
>>>
>>> Thanks!
>>>
>>> 
>

Discussion

  • XiaoPeng Wang

    XiaoPeng Wang - 2014-02-26

    Yang Song,
    I think the original change was trying to stop the nodestatus update instead of updating the boot status. The nodestatus is updated in 'xcatinstallpost'.

     
  • XiaoPeng Wang

    XiaoPeng Wang - 2014-02-26
    • assigned_to: yangsong
     
  • Guang Cheng Li

    Guang Cheng Li - 2014-02-26

    HI Linda,

    I am not sure if this is an issue or not, the site.nodestatus is used for large clusters tuning, the purpose is to avoid the updateflag.awk to interact with xcatd, thus reduce the load for network and management node. I agree that the description of site.nodestatus is not good enough. The doc http://sourceforge.net/apps/mediawiki/xcat/index.php?title=Hints_and_Tips_for_Large_Scale_Clusters reads:

    nodestatus
    If set to 'n', the nodelist.status column will not be updated during the node deployment, node discovery and power operations. Default is 'y', always update nodelist.status. Setting this to 'n' for large clusters can eliminate one node-to-server contact and one xCAT database write operation for each node during node deployment, but you will then need to determine deployment status through some other means.

    If the user set the site.nodestatus intentionally for large cluster tuning, we probably should not still have updateflag.awk to interact with xcatd, but for the infinite installation loop, we could doc some procedure:

    After the nodes installation starts:
    1) for non-UEFI mode, run rsetboot <noderange> hd
    2) run nodeset <noderange> offline

    We could add more descriptive information for site.nodestatus in the tabdump -d site, but my opionion is that we do not update the code logic.

     
  • yangsong

    yangsong - 2014-04-29

    fixed in 2.8 and 2.9:
    commit e071f801b27b98c39d51dac7bb3ca5caf4329175
    Merge: bb4ff64 b9d2723
    Author: immarvin yangsbj@cn.ibm.com
    Date: Tue Apr 29 00:19:43 2014 -0700

    Merge branch '2.8' of ssh://git.code.sf.net/p/xcat/xcat-core into 2.8
    

    commit bb4ff64e3348f7fccdf77b9d53a492f14d7abe86
    Author: immarvin yangsbj@cn.ibm.com
    Date: Tue Apr 29 00:18:59 2014 -0700

    fix defect #3997 node reinstall loop when site.nodestatus=0 Edit
    

    commit 71ed00d1a4bda588ec6795adae251783e81fd9e0
    Author: immarvin yangsbj@cn.ibm.com
    Date: Tue Apr 29 00:18:59 2014 -0700

    fix defect #3997 node reinstall loop when site.nodestatus=0 Edit
    
     
  • yangsong

    yangsong - 2014-04-29
    • status: open --> pending
     
  • Anonymous

    Anonymous - 2014-04-29
    • status: pending --> closed