From: SourceForge.net <no...@so...> - 2012-03-31 02:31:47
|
Bugs item #3378662, was opened at 2011-07-26 07:15 Message generated for change (Settings changed) made by daniceexi You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1006945&aid=3378662&group_id=208749 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Deployment-linux Group: 2.8 >Status: Closed Resolution: Fixed Priority: 5 Private: No Submitted By: Paul Herb (prherb) Assigned to: XiaoPeng Wang (daniceexi) Summary: p7-ih scaling issues during boot. Initial Comment: I was asked to open this by John Simpson, it is related to booting 251 diskless nodes on p7-ih building block.. Here is a brief description: This is a 3 frame p7-ih system.. 125 compute lpars on one SN.. 126 on the other. All ram based stateless with some statelite persistent files. procedure... - rpower off all 251 compute nodes... check to make sure they are all 'not activated' - bootlist is set to boot off of hf0 - rpower 251 compute nodes on. - wait until I can xdsh a simple command 'date' and they all come back with no errors. - total time was ~ 30 minutes. - The image loads pretty fast, watching a console. - The big delay seems to be with xcat on the service nodes.. occasionally issuing a simple xcat command on the service nodes, results in xcat timeouts. bb21s2a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s2a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s2a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. bb21s1a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s1a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s1a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. ---------------------------------------------------------------------- Comment By: Paul Herb (prherb) Date: 2012-03-22 04:28 Message: I concurr this is fine now.. ---------------------------------------------------------------------- Comment By: XiaoPeng Wang (daniceexi) Date: 2012-03-22 00:42 Message: This has been fixed by Jarrod that queue all the tcp connections and handle the them FIFO. Need to be verified. revision 11937 ---------------------------------------------------------------------- Comment By: XiaoPeng Wang (daniceexi) Date: 2012-02-19 19:07 Message: This issue can be recreated in a scaling env and Hua Zhong has recreated it but did not get good way to fix it. It should caused by the performance of the MN/SN that all cpu/mem has been used to handle the OS deployment and running of postscripts. Limit the number of nodes to be booted in parallel would be a possible way. ---------------------------------------------------------------------- Comment By: Guang Cheng Li (ligc) Date: 2012-02-16 22:43 Message: Xiao Peng, are we still having this problem within the latest xCAT build? since we have run a lot of testing in the scalability environment, I am assuming this should not be a problem any more. ---------------------------------------------------------------------- Comment By: XiaoPeng Wang (daniceexi) Date: 2011-08-10 01:05 Message: This is a performance issue and the possible solution is to retry in the xCAT client. ---------------------------------------------------------------------- Comment By: Paul Herb (prherb) Date: 2011-08-09 04:10 Message: These xcat timeouts were observed while all the nodes are booting. After everything is up and running, the command returns fine. This system has been powered off, you can put this on hold. ---------------------------------------------------------------------- Comment By: Linda Mellor (mellor) Date: 2011-08-05 06:07 Message: The otherpkgs postscript has checks in it that should be preventing it from trying to update software when run from postscripts during diskless/statelite boot. The assumption is that otherpkgs were handled during genimage. This code is at the very beginning of my /install/postscripts/otherpkgs file. You should see the message "Did not install any extra rpms" in the console output for your node during boot: # do nothing for diskless deployment case because it is done in the image already if [[ $UPDATENODE -ne 1 ]]; then if [ "$NODESETSTATE" = "netboot" -o \ "$NODESETSTATE" = "statelite" -o \ "$NODESETSTATE" = "diskless" -o \ "$NODESETSTATE" = "dataless" ] then echo " Did not install any extra rpms." exit 0 fi fi ---------------------------------------------------------------------- Comment By: XiaoPeng Wang (daniceexi) Date: 2011-08-05 01:57 Message: For each compute node, when running the postscript, there are several interactions (getpostscript, getcredentials, syncfiles) between the CN and SN through the 3001 port of SN. If there are 125 nodes were booting in parallel, the socket resource of SN should be very tight. I think there was retry in the CN when it tried to run the (getpostscript, getcredentials, syncfiles) Could you try to write a small program to run 'lsdef' in loop to see whether it could success after retrying? ---------------------------------------------------------------------- Comment By: Paul Herb (prherb) Date: 2011-08-04 07:24 Message: found the main culprit of the long boot process. The stateless compute nodes had postbootscripts=otherpkgs So they were installing all the rpms that genimage did.. again. I will get another timing during the next full boot... opened 3386191 for that problem. ---------------------------------------------------------------------- Comment By: Paul Herb (prherb) Date: 2011-07-27 04:11 Message: yes.. I attempting to run these commands while the 251 compute nodes are booting up.. I did this again today.. here are two commands that I attempted to run on the service nodes while everything was booting: xdsh service "lsxcatd -d" bb21s2a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s2a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s2a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. bb21s1a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s1a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s1a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. [root@edems1b rh]# xdsh service "lsdef -t site -l" bb21s2a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s2a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s2a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. bb21s1a: Unable to open socket connection to xcatd daemon on localhost:3001. bb21s1a: Verify that the xcatd daemon is running and that your SSL setup is correct. bb21s1a: Connection failure: IO::Socket::SSL: Timeout at /opt/xcat/lib/perl/xCAT/Client.pm line 159. ---------------------------------------------------------------------- Comment By: Guang Cheng Li (ligc) Date: 2011-07-26 19:32 Message: HI Paul, One quick question, you mentioned "occasionally issuing a simple xcat command on the service nodes, results in xcat timeouts", was this simple xcat command run when all the compute nodes are booting up? After the computes are up and running, if you run this "simple xcat command", will we still see the error? How much time will it take to run a simple xcat command on the service node? I want to determine if this is really a scalability problem or a xcatd problem on the service node. Thanks. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1006945&aid=3378662&group_id=208749 |