Hi all,
We have an issue where the syncfiles script fails to run correctly inside anaconda on an el7 system.
Aftert the install updatenode <nodename> -F
works as expected; and running the syncfiles script by running /xcatpost/myspostscript
manually also works after reboot.
This could be an issue that we need to workaround.
Also running service sshd restart or any service restart in preboot don't work, as the systemctl command is not supported in a chrooted environment. Maybe we can us systemd-nspawn
?
This is a critical issue for me, as we have 3 systems about to go live, and we'd like to understand and work together to get the problem fixed.
Below is the excerpt from /var/log/xcat.log
, after adding set -ax
into the syncfiles
script
Tue Feb 24 11:36:20 GMT 2015 Running postscript: syncfiles + '[' -n 0 ']' + '[' 0 -eq 1 ']' + '[' -n '' ']' + logger -t xCAT -p local4.info 'Performing syncfiles postscript' ++ uname + osname=Linux + xcatpostdir=/xcatpost + logger -t xCAT -p local4.info './syncfiles: the OS name = Linux' + quit=no + count=5 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 5 -eq 0 ']' + let SLI=29009%10 + let SLI=SLI+10 + sleep 19 + let count=count-1 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 4 -eq 0 ']' + let SLI=7396%10 + let SLI=SLI+10 + sleep 16 + let count=count-1 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 3 -eq 0 ']' + let SLI=23745%10 + let SLI=SLI+10 + sleep 15 + let count=count-1 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 2 -eq 0 ']' + let SLI=6936%10 + let SLI=SLI+10 + sleep 16 + let count=count-1 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 1 -eq 0 ']' + let SLI=22698%10 + let SLI=SLI+10 + sleep 18 + let count=count-1 + '[' no = no ']' + '[' Linux = Linux ']' ++ /xcatpost/startsyncfiles.awk + returncode=1 + '[' 1 -eq 0 ']' + '[' 0 -eq 0 ']' + quit=yes + let count=count-1 + '[' yes = no ']' + '[' 1 -eq 0 ']' + logger -t xCAT -p local4.err './syncfiles: Perform Syncing File action encountered error' + exit 0
Just updated the priority to 9, as we now have 5 clusters that are having this issue.
I am also actively looking at it, to see if I can also find the fix.
Yang Song, could you work with Arif on this issue? Thanks.
Hi Arif, I am trying to recreate the problem on my env.
I changed the priority to "8" to avoid some process troubles since we are doing the release review.
Yang, No problems,
thanks for looking
much appreciated
hi Arif,
would you please provide the following info:
1. the "synclists" attribute in your node/osimage definition, and the content of the synclist file
some hints for your debugging:
As you can see in syncfiles and startsyncfiles.awk, the postscript sends a "syncfiles" request to xcatd on MN, the plugin that processes the request on MN is:/opt/xcat/lib/perl/xCAT_plugin/syncfiles.pm
You can run xcat with "xcatd -f" on the MN during cn provision and print debug info in the plugin
Hi Yang,
See below of the xcatd -f output
and I have attaced the mypostscript from compute03
Last edit: Arif Ali 2015-03-03
Just remembered,
as part of the remoteshell script, this tries to restart sshd daemon. But in rhel/centos 7 based systems it is systemd, it does not like to restart services in chrooted environments, and therefore the MN will not be able to ssh to the machine.
This makes sense, and therefore syncfiles will not work over rsync or ssh
Maybe I am barking up the wrong path !!
hi Arif,
It seems there are some problem for "syncfiles" on redhat7. The file system hierarchy seen by MN is the anaconda file system when the "syncfiles" is run. The file system after the 1st reboot is under "/mnt/sysimage" at that moment, so the stuff in the synclist is lost after installation. I also found the same problem in rhels6.4 and will try to fix it.
As a workaround, you can specify the destination path in synclist according to the anaconda file system hierarchy
Hi Yang,
Will this be done for 2.9.1
My workaround is to add the syncfiles in postbootscripts, which works. But there may be some things we'd like to happen before booting, i.e. in anaconda
thanks
Arif
hi Arif,
For Redhat, have you tried to specify the destination path with the base deirectory "/mnt/sysimage/"?
For example, syncing /etc/hosts from MN to CN, the entry in synclist can be specified as:
"/etc/hosts -> /mnt/sysimage/etc/hosts"
thanks
I could, but then, the syncfiles wouldn't work with "updatenode -F", so that would mean to have to seperate set of syncfiles, one for anaconda, and one for post boot. or have duplicate entries in syncfiles
Hi Arif,
Sorry for the inconvenience.
This is just a workaround, I am still looking into the fix for this problem, since xCAT 2.9.1 will be released a week later, the fix cannot be included in this xCAT release.
The fix will be uploaded when it is finished.
thanks
The root cause was the sshd was not started successfully on the compute node before the first reboot. That means when the syncfile postscript was running, the sshd still was not ready.
So my fix is very straightforward:
At end of the remoteshell, check whether the sshd has been started. If not, start it by calling command directly like running '/usr/sbin/sshd'.
Commit:
2.9.2: 150a661
2.10: 9616e68
quick question, how often are the snap builds. I'd like to pick up the new RPM so that this can be deployed on the sites that I am having issues?
HI Arif,
I just did a 2.10 snapshot build 1 minute ago, you should be able to pick up the latest development build to get this fix. We do not have plan for 2.9.2 yet, so I did not snapshot the 2.9.2.
Or, another simple way is to just pick up the changes in the commit. The change is quite minor.
Thanks Guang,
I need it be be re-distributable, so that future installs of xCAT 2.9 are good as well, and don't really want to remember to patch every time. Also when we re-install and do xCAT-HA through pacemaker/corosync, I need to make sure that things are and can be automated
I have already gone through the process and created an RPM so that the patch will be available.
For reference, 2 links below show the commit in my git repo, and then the build I did
I believe this is a critical bug, that will come and bite a lot of people, as it has us, and I think we should think about releasing a patched update to solve the problems to any customers who are proposing to go to centos/rhel 7.
Thanks for all the efforts
regards,
Arif