|
From: Christopher H. <chr...@al...> - 2011-04-22 23:26:04
|
I'm using xcat-core-2.5.2 and xcat-dep-201102240545 to discover and install CentOS 5.6 on x86_64 nodes. This is a fresh install. I have just installed the OS on the management node and configured xCAT and I'm now attempting to perform node discovery and installation on the compute nodes. The discovery is working, and the installation starts, with all packages being installed, but doesn't finish. It gets stuck, apparently at /tmp/mypostscript.post. There are repeated "node01 xcat: Retrying flag update" messages in /var/log/messages, and the messages preceding those suggest it is getting stuck at the 'updateflag.awk $MASTER 3002 "installstatus booted"' in /tmp/mypostscript.post (the file residing on the node). (This line is repeated (i.e. it appears twice in a row and the two appearances are identical) at the end of /tmp/mypostscript.post, which seems counterintuitive but maybe it's supposed to be that way.) Upon attempted installation, the last messages in /var/log/messages from the node are Apr 22 15:17:34 node01 syslogd 1.4.1: restart. Apr 22 15:17:34 node01 xCAT: Install: syslog setup Apr 22 15:17:34 node01 kernel: klogd 1.4.1, log source = /proc/kmsg started. Apr 22 15:17:34 node01 xcat: Install: setup /etc/ssh/sshd_config Apr 22 15:17:34 node01 xcat: Install: setup root .ssh Apr 22 15:17:36 node01 xCAT: ssh_dsa_hostkey Apr 22 15:17:36 node01 xCAT: ssh_rsa_hostkey Apr 22 15:17:37 node01 xCAT: ssh_root_key Apr 22 15:17:37 node01 xCAT: start up sshd Apr 22 15:17:37 node01 xCAT: Performing syncfiles postscript Apr 22 15:17:37 node01 xCAT: /xcatpost/syncfiles: the OS name = Linux Apr 22 15:17:38 node01 xCAT: /xcatpost/syncfiles: Perform Syncing File action encountered error Apr 22 15:17:38 node01 xcat: repos/centos5.6/x86_64 is not a directory Apr 22 15:17:38 node01 xcat: Retrying flag update Apr 22 15:18:18 node01 last message repeated 4 times ...and then indefinite repeats of the last message. (I think the Syncing File action error is just that the syncfile is empty; I don't think it's significant.) If I then perform "nodeset node01 boot; rpower node01 reset", it boots up and these are last messages in /var/log/messages are Apr 22 15:34:24 node01 pcscd: hotplug_libusb.c:411:HPEstablishUSBNotifications() Polling forced every 1 second(s) Apr 22 15:34:24 node01 kernel: Bluetooth: HIDP (Human Interface Emulation) ver 1.1 Apr 22 15:34:24 node01 hidd[6216]: Bluetooth HID daemon Apr 22 15:34:26 node01 automount[6271]: lookup_read_master: lookup(nisplus): couldn't locate nis+ table auto.master Apr 22 15:34:26 node01 xinetd[6309]: xinetd Version 2.3.14 started with libwrap loadavg labeled-networking options compiled in. Apr 22 15:34:26 node01 xinetd[6309]: Started working: 0 available services Apr 22 15:34:28 node01 xcat: Retrying flag update ...and so on. Ports 3001 and 3002 are open on the management node: iptables is stopped on the management node, and running nmap from node01 on the management node returns "open" as the state of ports 3001 and 3002 (and I also get "open" for both ports running it on the management node on the cluster-facing interface). If, on the compute node (node01) I run ". /tmp/mypostscript,post" followed by "echo $MASTER", I get the correct IP address for the management node. There is a file /root/post.log on the compute node. It contains only the line "post scripts". xCAT processes running on the management node are: [root@mnnode admin]# ps axu |grep xcatd root 5765 0.0 0.0 61196 776 pts/1 S+ 14:37 0:00 grep xcatd root 6604 0.0 0.1 237656 93332 ? Ss Apr21 0:00 xcatd: SSL listener root 6605 0.0 0.1 234024 93784 ? S Apr21 0:05 xcatd: DB Access root 6606 0.0 0.1 235588 92704 ? S Apr21 0:00 xcatd: UDP listener root 6607 0.0 0.1 237656 92796 ? S Apr21 0:00 xcatd: install monitor On the compute node (after rebooting with nodeset node01 boot; rpower node01 reset) we have: [root@node01 ~]# ps axu |grep xcat root 6363 0.0 0.0 66092 1512 ? S 23:35 0:00 /bin/sh /etc/rc3.d/S84xcatpostinit1 start root 6366 0.0 0.0 65964 1360 ? S 23:35 0:00 /bin/sh /opt/xcat/xcatinstallpost root 6376 0.0 0.0 65964 776 ? S 23:35 0:00 /bin/sh /opt/xcat/xcatinstallpost root 6380 0.0 0.0 63960 1012 ? S 23:35 0:00 /bin/awk -f /xcatpost/updateflag.awk 172.20.1.1 3002 installstatus booted root 6381 0.0 0.0 58952 664 ? S 23:35 0:00 logger -t xcat root 6480 0.0 0.0 61228 812 pts/0 S+ 23:40 0:00 grep xcat [root@node01 ~]# The site table contains the lines "xcatdport","3001",, "xcatiport","3002",, SELinux is disabled. tftpd, conserver, httpd, nfs, named, dhcpd are all running. "master" and "domain" are correctly set in the site table. I have looked at this discussion, where their installation froze at the same place. http://www.xcat.org/pipermail/xcat-user/2008-July/006618.html However, in their case it was an issue with acl on the switch. On our switch we have no acls defined. Suggestions much appreciated! Regards, Chris |