Menu

#4579 rhels7 anaconda systemd and chroot problems

2.10
closed
postscripts
8
2015-07-06
2015-02-24
Arif Ali
No

Hi all,

We have an issue where the syncfiles script fails to run correctly inside anaconda on an el7 system.

Aftert the install updatenode <nodename> -F works as expected; and running the syncfiles script by running /xcatpost/myspostscript manually also works after reboot.

This could be an issue that we need to workaround.

Also running service sshd restart or any service restart in preboot don't work, as the systemctl command is not supported in a chrooted environment. Maybe we can us systemd-nspawn?

This is a critical issue for me, as we have 3 systems about to go live, and we'd like to understand and work together to get the problem fixed.

Below is the excerpt from /var/log/xcat.log, after adding set -ax into the syncfiles script

Tue Feb 24 11:36:20 GMT 2015 Running postscript: syncfiles
+ '[' -n 0 ']'
+ '[' 0 -eq 1 ']'
+ '[' -n '' ']'
+ logger -t xCAT -p local4.info 'Performing syncfiles postscript'
++ uname
+ osname=Linux
+ xcatpostdir=/xcatpost
+ logger -t xCAT -p local4.info './syncfiles: the OS name = Linux'
+ quit=no
+ count=5
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 5 -eq 0 ']'
+ let SLI=29009%10
+ let SLI=SLI+10
+ sleep 19
+ let count=count-1
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 4 -eq 0 ']'
+ let SLI=7396%10
+ let SLI=SLI+10
+ sleep 16
+ let count=count-1
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 3 -eq 0 ']'
+ let SLI=23745%10
+ let SLI=SLI+10
+ sleep 15
+ let count=count-1
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 2 -eq 0 ']'
+ let SLI=6936%10
+ let SLI=SLI+10
+ sleep 16
+ let count=count-1
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 1 -eq 0 ']'
+ let SLI=22698%10
+ let SLI=SLI+10
+ sleep 18
+ let count=count-1
+ '[' no = no ']'
+ '[' Linux = Linux ']'
++ /xcatpost/startsyncfiles.awk
+ returncode=1
+ '[' 1 -eq 0 ']'
+ '[' 0 -eq 0 ']'
+ quit=yes
+ let count=count-1
+ '[' yes = no ']'
+ '[' 1 -eq 0 ']'
+ logger -t xCAT -p local4.err './syncfiles: Perform Syncing File action encountered error'
+ exit 0

Discussion

  • Arif Ali

    Arif Ali - 2015-02-24
    • Priority: 7 --> 9
     
  • Arif Ali

    Arif Ali - 2015-02-24

    Just updated the priority to 9, as we now have 5 clusters that are having this issue.

    I am also actively looking at it, to see if I can also find the fix.

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-03-02

    Yang Song, could you work with Arif on this issue? Thanks.

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-03-02
    • assigned_to: yangsong
     
  • yangsong

    yangsong - 2015-03-03
    • Priority: 9 --> 8
     
  • yangsong

    yangsong - 2015-03-03

    Hi Arif, I am trying to recreate the problem on my env.

    I changed the priority to "8" to avoid some process troubles since we are doing the release review.

     
  • Arif Ali

    Arif Ali - 2015-03-03

    Yang, No problems,

    thanks for looking

    much appreciated

     
  • yangsong

    yangsong - 2015-03-03

    hi Arif,

    would you please provide the following info:
    1. the "synclists" attribute in your node/osimage definition, and the content of the synclist file

    1. the content of /xcatpost/mypostscript on your CN after provision

    some hints for your debugging:
    As you can see in syncfiles and startsyncfiles.awk, the postscript sends a "syncfiles" request to xcatd on MN, the plugin that processes the request on MN is:/opt/xcat/lib/perl/xCAT_plugin/syncfiles.pm

    You can run xcat with "xcatd -f" on the MN during cn provision and print debug info in the plugin

     
  • Arif Ali

    Arif Ali - 2015-03-03

    Hi Yang,

    See below of the xcatd -f output

    xCAT: Allowing syncfiles from compute03
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute04
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute06
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute05
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute03
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute04
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute06
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute03
    dcp command failed, Return code=1.
    xCAT: Allowing syncfiles from compute05
    dcp command failed, Return code=1.
    

    and I have attaced the mypostscript from compute03

     

    Last edit: Arif Ali 2015-03-03
  • Arif Ali

    Arif Ali - 2015-03-03

    Just remembered,

    as part of the remoteshell script, this tries to restart sshd daemon. But in rhel/centos 7 based systems it is systemd, it does not like to restart services in chrooted environments, and therefore the MN will not be able to ssh to the machine.

    This makes sense, and therefore syncfiles will not work over rsync or ssh

    Maybe I am barking up the wrong path !!

     
  • yangsong

    yangsong - 2015-03-08

    hi Arif,

    It seems there are some problem for "syncfiles" on redhat7. The file system hierarchy seen by MN is the anaconda file system when the "syncfiles" is run. The file system after the 1st reboot is under "/mnt/sysimage" at that moment, so the stuff in the synclist is lost after installation. I also found the same problem in rhels6.4 and will try to fix it.

    As a workaround, you can specify the destination path in synclist according to the anaconda file system hierarchy

     
  • Arif Ali

    Arif Ali - 2015-03-10

    Hi Yang,

    Will this be done for 2.9.1

    My workaround is to add the syncfiles in postbootscripts, which works. But there may be some things we'd like to happen before booting, i.e. in anaconda

    thanks
    Arif

     
  • yangsong

    yangsong - 2015-03-11

    hi Arif,

    For Redhat, have you tried to specify the destination path with the base deirectory "/mnt/sysimage/"?

    For example, syncing /etc/hosts from MN to CN, the entry in synclist can be specified as:
    "/etc/hosts -> /mnt/sysimage/etc/hosts"

    thanks

     
  • Arif Ali

    Arif Ali - 2015-03-15

    I could, but then, the syncfiles wouldn't work with "updatenode -F", so that would mean to have to seperate set of syncfiles, one for anaconda, and one for post boot. or have duplicate entries in syncfiles

     
  • yangsong

    yangsong - 2015-03-16

    Hi Arif,

    Sorry for the inconvenience.

    This is just a workaround, I am still looking into the fix for this problem, since xCAT 2.9.1 will be released a week later, the fix cannot be included in this xCAT release.

    The fix will be uploaded when it is finished.

    thanks

     
  • yangsong

    yangsong - 2015-03-17
    • Milestone: 2.9.1 --> 2.10
     
  • XiaoPeng Wang

    XiaoPeng Wang - 2015-04-01

    The root cause was the sshd was not started successfully on the compute node before the first reboot. That means when the syncfile postscript was running, the sshd still was not ready.

    So my fix is very straightforward:
    At end of the remoteshell, check whether the sshd has been started. If not, start it by calling command directly like running '/usr/sbin/sshd'.

    Commit:
    2.9.2: 150a661
    2.10: 9616e68

     
  • XiaoPeng Wang

    XiaoPeng Wang - 2015-04-01
    • status: open --> pending
    • assigned_to: yangsong --> XiaoPeng Wang
     
  • Arif Ali

    Arif Ali - 2015-04-01

    quick question, how often are the snap builds. I'd like to pick up the new RPM so that this can be deployed on the sites that I am having issues?

     
  • Guang Cheng Li

    Guang Cheng Li - 2015-04-02

    HI Arif,

    I just did a 2.10 snapshot build 1 minute ago, you should be able to pick up the latest development build to get this fix. We do not have plan for 2.9.2 yet, so I did not snapshot the 2.9.2.

    Or, another simple way is to just pick up the changes in the commit. The change is quite minor.

    [ligc@ligc-1 xcat-core]$ git show 9616e68
    commit 9616e681a5eb600fbd7c07ea800ee4be35ee4eaa
    Author: WangXiaoPeng <daniceexi@163.com>
    Date:   Wed Apr 1 03:50:28 2015 -0400
    
        defect 4579: check the running of sshd at end of remoteshell, start it if needed
    
    diff --git a/xCAT/postscripts/remoteshell b/xCAT/postscripts/remoteshell
    index dc9c14e..a3895c4 100755
    --- a/xCAT/postscripts/remoteshell
    +++ b/xCAT/postscripts/remoteshell
    @@ -469,4 +469,14 @@ else
         restartservice sshd
     fi
    
    +# check whether the sshd daemon has been started successfully
    +# As we known that for rh7 the sshd cannot be started by systemctl in chroot mode
    +ps aux | grep -v grep | grep sshd
    +
    +if [ $? -ne 0 ]; then
    +    if [ -e "/usr/sbin/sshd" ]; then
    +        /usr/sbin/sshd
    +    fi
    +fi
    +
     kill -9 $CREDPID
    [ligc@ligc-1 xcat-core]$ 
    
     
  • Arif Ali

    Arif Ali - 2015-04-02

    Thanks Guang,

    I need it be be re-distributable, so that future installs of xCAT 2.9 are good as well, and don't really want to remember to patch every time. Also when we re-install and do xCAT-HA through pacemaker/corosync, I need to make sure that things are and can be automated

    I have already gone through the process and created an RPM so that the patch will be available.

    For reference, 2 links below show the commit in my git repo, and then the build I did

    I believe this is a critical bug, that will come and bite a lot of people, as it has us, and I think we should think about releasing a patched update to solve the problems to any customers who are proposing to go to centos/rhel 7.

    Thanks for all the efforts

    regards,
    Arif

     
  • XiaoPeng Wang

    XiaoPeng Wang - 2015-07-06
    • status: pending --> closed