Menu

#4243 [Hierarchical] Associate CNs with post script of remoteshell in hierarchical env, then Provisioning will be failed

2.9
closed
PCM
5
2014-12-10
2014-08-11
fengli
No

[Hierarchical] Associate CNs with post script of remoteshell in hierarchical env, then Provisioning will be failed
Description:
root-cause: defined ssh info in zone table cannot be used for hierarchical cluster.
cn1-sn1 is a stateful compute node which belongs to sn1 as his service node. it also belonged to "__Managed" group which was defined by PCM by default.
but during provisioing it, it will be failed due to run remoteshell postscript failed. (ssh pub key error), see below log messages for details:

cn1-sn1 info:
[root@pcm187 ~]# lsdef cn1-sn1
Object name: cn1-sn1
appstatus=provision=defined
appstatustime=08-09-2014 14:32:04
arch=x86_64
bmc=9.111.251.140
bmcpassword=PASSW0RD
bmcusername=USERID
chain=runcmd=bmcsetup,osimage=rhels6.4-x86_64-stateful-compute:--noupdateinitrd:reboot4deploy
cmdmapping=/opt/pcm/etc/hwmgt/mappings/HWCmdMapping_rackmount_x.xml
conserver=sn1
currchain=boot
currstate=boot
groups=Managed,NetworkProfile_cn-sn-profile,ImageProfile_rhels6.4-x86_64-stateful-compute,HardwareProfile_IBM_System_x_M4,compute
initrd=xcat/osimage/rhels6.4-x86_64-stateful-compute/initrd.img
installnic=eth1
ip=10.1.0.3
kcmdline=quiet repo=http://192.168.0.1:80/install/rhels6.4/x86_64 ks=http://192.168.0.1:80/install/autoinst/cn1-sn1 ksdevice=eth1 cmdline console=tty0 console=ttyS0,115200n8r
kernel=xcat/osimage/rhels6.4-x86_64-stateful-compute/vmlinuz
mac=6c:ae:8b:3c:9a:9b
mgt=ipmi
monserver=sn1,sn1-c
netboot=xnba
nfsserver=192.168.0.1
nichostnamesuffixes.bmc=-bmc
nichostnamesuffixes.eth1=-eth1
nicips.bmc=9.111.251.140
nicips.eth1=10.1.0.3
nicnetworks.bmc=public
nicnetworks.eth1=cn-snnet
nictypes.bmc=BMC
nictypes.eth1=Ethernet
os=rhels6.4
postbootscripts=syncfiles,ospkgs,otherpkgs,mountnfs,confignics,setupcnldaplient
postscripts=syslog,remoteshell,syncfiles,setminiuidgid,mkresolvconf,setupntp,setnetboot
primarynic=eth1
profile=compute
provmethod=rhels6.4-x86_64-stateful-compute
serialflow=hard
serialport=0
serialspeed=115200
servicenode=sn1
status=booted
statustime=08-09-2014 14:44:00
updatestatustime=08-09-2014 14:32:04
xcatmaster=10.1.0.1

log info (notes: 10.1.0.1 is its service node IP address):

[root@pcm187 ~]# vim /var/log/messages
Aug 8 02:38:14 cn1-sn1 xCAT: ./syncfiles: the OS name = Linux
Aug 6 10:38:19 sn1 xCAT[22086]: xCAT: Allowing syncfiles from cn1-sn1
Aug 8 02:38:14 cn1-sn1 sshd[5329]: Failed password for root from 10.1.0.1 port 33277 ssh2
Aug 8 02:38:14 cn1-sn1 sshd[5329]: Failed password for root from 10.1.0.1 port 33277 ssh2
Aug 8 02:38:14 cn1-sn1 sshd[5330]: Connection closed by 10.1.0.1

[root@cn1-sn1 log]# vim xcat/xcat.log
Fri Aug 8 02:33:11 CST 2014 Running postscript: remoteshell
<error>Unable to read root's public ssh key</error>
<error>Unable to read root's private ssh key</error>

PCM have a temp workaround this issue by Modify database from MN (Zone and site table), for manual work-around:
Steps:
[root@pcm187 hostkeys]# tabdump zone

zonename,sshkeydir,sshbetweennodes,defaultzone,comments,disable

"Managed","/etc/xcat/sshkeys/Managed/.ssh","yes","yes",, ==============>>> delete this line and modify no to yes for xcatdefault zone.
"xcatdefault","/root/.ssh","yes","no",,
[root@pcm187 hostkeys]# ^C
[root@pcm187 hostkeys]# tabdump site |grep ssh
"maxssh","8",,
"sshbetweennodes","no",, =====================>>>> modify this from no to yes
[root@pcm187 hostkeys]# tabedit site
[root@pcm187 hostkeys]# tabedit zone
[root@pcm187 hostkeys]# tabdump zone

zonename,sshkeydir,sshbetweennodes,defaultzone,comments,disable

"xcatdefault","/root/.ssh","yes","yes",,
[root@pcm187 hostkeys]# tabdump site |grep ssh
"maxssh","8",,
"sshbetweennodes","yes",,

After the work-around by above, then all work well.
Need to be fixed: defined ssh info in zone table for additional line could be work well.

Discussion

  • Guang Cheng Li

    Guang Cheng Li - 2014-08-11
    • assigned_to: Lissa Valletta
     
  • Guang Cheng Li

    Guang Cheng Li - 2014-08-11

    Lissa, this is related to the multiple zone support, could you help take a look? Thanks.

     
  • Lissa Valletta

    Lissa Valletta - 2014-08-11

    Zones does support hierarchy. Did you use this documentation to setup your zones.
    https://sourceforge.net/p/xcat/wiki/Setting_Up_Zones/

    If you notice in the doc sshbetweennodes is not used with zones. Did you run updatenode -k after creating the zone, to update the service nodes as indicated in the documenatin. Note also the documentation indicates only use the zone commands to modify the tables, do not edit them as you did. The zone commands does a lot of things to keep the zones straight. If you could start over follow this documentation carefully and if you still have problems let me log into the Management node. Note also when you start using zones all you cluster will be using zones. All the other nodes will be assigned to the default zone. So be careful.

    I also do not see zonename attribute defined for your cn1-sn1 node above. Did you remove it? If the node had been defined correctly in the Managed zone, you should have had a zonename=<zonename> in your case zonename="Managed" attribute in the lsdef of the node. Again only use the mkzone and chzone and rmzone commands to create and maintain zones.

    If you start over you will need to delete all entries in the zone table, you will need to clean up any zonename attribute in the nodelist table for the nodes.

     

    Last edit: Lissa Valletta 2014-08-12
  • Lissa Valletta

    Lissa Valletta - 2014-08-12

    use rmzone as much as possible to clean up. In the end you should have nothing defined in the zone table and nothing defined in zonename attribute for any node, to start fresh.

     
  • Lissa Valletta

    Lissa Valletta - 2014-08-21

    Have had no followup from development. I am moving to 2.9. I can provide a patch or fix in the 2.8.5 and 2.8.5-pcm branches when availabl.

     
  • Lissa Valletta

    Lissa Valletta - 2014-08-21
    • Milestones: 2.8.5 --> 2.9
     
  • Lissa Valletta

    Lissa Valletta - 2014-08-21
    • status: open --> pending
     
  • Xu Bin

    Xu Bin - 2014-08-29

    It's okay to leave it to 2.9 as now service node supporting is not a basic feature in PCM, and only supported by solution.

    And just clarify the PCM use case now:
    After installation, PCM created the default zone ('Managed') for all compute nodes, it will replace the original default zone (xcatdefault), so actually no zonename="Managed" attribute there.

     
  • ting ting li

    ting ting li - 2014-12-10
    • status: pending --> closed