xCAT / Bugs / #4579 rhels7 anaconda systemd and chroot problems

Arif Ali - 2015-02-24

Priority: 7 --> 9
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arif Ali - 2015-02-24

Just updated the priority to 9, as we now have 5 clusters that are having this issue.

I am also actively looking at it, to see if I can also find the fix.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2015-03-02

Yang Song, could you work with Arif on this issue? Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Guang Cheng Li - 2015-03-02

assigned_to: yangsong
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-03

Priority: 9 --> 8
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-03

Hi Arif, I am trying to recreate the problem on my env.

I changed the priority to "8" to avoid some process troubles since we are doing the release review.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arif Ali - 2015-03-03

Yang, No problems,

thanks for looking

much appreciated

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-03

hi Arif,

would you please provide the following info:
1. the "synclists" attribute in your node/osimage definition, and the content of the synclist file

the content of /xcatpost/mypostscript on your CN after provision

some hints for your debugging:
As you can see in syncfiles and startsyncfiles.awk, the postscript sends a "syncfiles" request to xcatd on MN, the plugin that processes the request on MN is:/opt/xcat/lib/perl/xCAT_plugin/syncfiles.pm

You can run xcat with "xcatd -f" on the MN during cn provision and print debug info in the plugin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Yang,

See below of the xcatd -f output

xCAT: Allowing syncfiles from compute03
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute04
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute06
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute05
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute03
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute04
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute06
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute03
dcp command failed, Return code=1.
xCAT: Allowing syncfiles from compute05
dcp command failed, Return code=1.

and I have attaced the mypostscript from compute03

Last edit: Arif Ali 2015-03-03

mypostscript

Arif Ali - 2015-03-03

Just remembered,

as part of the remoteshell script, this tries to restart sshd daemon. But in rhel/centos 7 based systems it is systemd, it does not like to restart services in chrooted environments, and therefore the MN will not be able to ssh to the machine.

This makes sense, and therefore syncfiles will not work over rsync or ssh

Maybe I am barking up the wrong path !!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-08

hi Arif,

It seems there are some problem for "syncfiles" on redhat7. The file system hierarchy seen by MN is the anaconda file system when the "syncfiles" is run. The file system after the 1st reboot is under "/mnt/sysimage" at that moment, so the stuff in the synclist is lost after installation. I also found the same problem in rhels6.4 and will try to fix it.

As a workaround, you can specify the destination path in synclist according to the anaconda file system hierarchy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arif Ali - 2015-03-10

Hi Yang,

Will this be done for 2.9.1

My workaround is to add the syncfiles in postbootscripts, which works. But there may be some things we'd like to happen before booting, i.e. in anaconda

thanks
Arif

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-11

hi Arif,

For Redhat, have you tried to specify the destination path with the base deirectory "/mnt/sysimage/"?

For example, syncing /etc/hosts from MN to CN, the entry in synclist can be specified as:
"/etc/hosts -> /mnt/sysimage/etc/hosts"

thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arif Ali - 2015-03-15

I could, but then, the syncfiles wouldn't work with "updatenode -F", so that would mean to have to seperate set of syncfiles, one for anaconda, and one for post boot. or have duplicate entries in syncfiles

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-16

Hi Arif,

Sorry for the inconvenience.

This is just a workaround, I am still looking into the fix for this problem, since xCAT 2.9.1 will be released a week later, the fix cannot be included in this xCAT release.

The fix will be uploaded when it is finished.

thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

yangsong - 2015-03-17

Milestone: 2.9.1 --> 2.10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

XiaoPeng Wang - 2015-04-01

The root cause was the sshd was not started successfully on the compute node before the first reboot. That means when the syncfile postscript was running, the sshd still was not ready.

So my fix is very straightforward:
At end of the remoteshell, check whether the sshd has been started. If not, start it by calling command directly like running '/usr/sbin/sshd'.

Commit:
2.9.2: 150a661
2.10: 9616e68

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

XiaoPeng Wang - 2015-04-01

status: open --> pending

assigned_to: yangsong --> XiaoPeng Wang
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arif Ali - 2015-04-01

quick question, how often are the snap builds. I'd like to pick up the new RPM so that this can be deployed on the sites that I am having issues?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HI Arif,

I just did a 2.10 snapshot build 1 minute ago, you should be able to pick up the latest development build to get this fix. We do not have plan for 2.9.2 yet, so I did not snapshot the 2.9.2.

Or, another simple way is to just pick up the changes in the commit. The change is quite minor.

[ligc@ligc-1 xcat-core]$ git show 9616e68
commit 9616e681a5eb600fbd7c07ea800ee4be35ee4eaa
Author: WangXiaoPeng <daniceexi@163.com>
Date:   Wed Apr 1 03:50:28 2015 -0400

    defect 4579: check the running of sshd at end of remoteshell, start it if needed

diff --git a/xCAT/postscripts/remoteshell b/xCAT/postscripts/remoteshell
index dc9c14e..a3895c4 100755
--- a/xCAT/postscripts/remoteshell
+++ b/xCAT/postscripts/remoteshell
@@ -469,4 +469,14 @@ else
     restartservice sshd
 fi

+# check whether the sshd daemon has been started successfully
+# As we known that for rh7 the sshd cannot be started by systemctl in chroot mode
+ps aux | grep -v grep | grep sshd
+
+if [ $? -ne 0 ]; then
+    if [ -e "/usr/sbin/sshd" ]; then
+        /usr/sbin/sshd
+    fi
+fi
+
 kill -9 $CREDPID
[ligc@ligc-1 xcat-core]$

Arif Ali - 2015-04-02

Thanks Guang,

I need it be be re-distributable, so that future installs of xCAT 2.9 are good as well, and don't really want to remember to patch every time. Also when we re-install and do xCAT-HA through pacemaker/corosync, I need to make sure that things are and can be automated

I have already gone through the process and created an RPM so that the patch will be available.

For reference, 2 links below show the commit in my git repo, and then the build I did

http://xcat.ocf.co.uk/ocf_builds/2.9.1/

http://gitlab.ocf.co.uk/aali/xcat-core/compare/2.9.1...2.9.1-ocf

I believe this is a critical bug, that will come and bite a lot of people, as it has us, and I think we should think about releasing a patched update to solve the problems to any customers who are proposing to go to centos/rhel 7.

Thanks for all the efforts

regards,
Arif
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

XiaoPeng Wang - 2015-07-06

status: pending --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

rhels7 anaconda systemd and chroot problems

An extreme cluster/cloud administration toolkit

Milestone

Searches

Help

#4579 rhels7 anaconda systemd and chroot problems

Discussion