|
From: SourceForge.net <no...@so...> - 2009-10-15 07:06:46
|
Bugs item #2871929, was opened at 2009-10-03 01:55 Message generated for change (Settings changed) made by daniceexi You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1006945&aid=2871929&group_id=208749 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: v2.3 >Status: Pending Resolution: Accepted Priority: 8 Private: No Submitted By: Connie Graff (cgraff) Assigned to: XiaoPeng Wang (daniceexi) Summary: AMM firmware, Blade FW/BIOS, etc. levels needed for xCAT 2.3 Initial Comment: Here is the level of xCAT on my management node: root@c954mgrs1:/ > rpm -qa | grep xCAT perl-xCAT-2.3-snap200909280836 xCAT-client-2.3-snap200909280837 xCAT-server-2.3-snap200909280837 xCAT-2.3-snap200909280837 xCAT-rmc-2.3-snap200909280837 The cluster nodes are mostly JS22 blades. Here is the failure: root@c954mgrs1:/ > rspconfig amm pd1=redwoperf pd2=redwoperf c954c1mm1: Error: Unable to change power management settings, domain may be oversubscribed. c954c2mm1: Error: Unable to change power management settings, domain may be oversubscribed. c954c1mm1: Error: Unable to change power management settings, domain may be oversubscribed. c954c2mm1: Error: Unable to change power management settings, domain may be oversubscribed. c954c3mm1: Error: Unable to change power management settings, domain may be oversubscribed. c954c3mm1: Error: Unable to change power management settings, domain may be oversubscribed. amm is a node group containing the three AMMs listed in the error messages. This bug has effectively blocked my install of the JS22 blades because it prevents me from collecting the MAC addresses with the getmacs command. The cluster currently has 42 blades but will grow to 98. I could manually collect the MACs myself, but that is not a viable option for a cluster this size. There were changes to the network recently to accommodate merging two smaller clusters into one, and I realized earlier today that I had not changed the eth0 interface of the AMM to refect that change. I have since corrected the eth0 configuration on one AMM. It did not resolve the problem. I believe my network definition is correct. root@c954mgrs1:/ > netstat -rn Routing tables Destination Gateway Flags Refs Use If Exp Groups Route Tree for Protocol Family 2 (Internet): default 9.114.70.254 UG 2 138137 en0 - - 9.114.70.0 9.114.70.120 UHSb 0 0 en0 - - => root@c954mgrs1:/ > lsdef -t network -l Object name: inst1 gateway=9.114.70.254 mask=255.255.255.0 net=9.114.70.0 The MN has AIX 61H installed (0939B gold build) installed. I have successfully run the command on a smaller JS21 cluster with an earlier xCAT build. ---------------------------------------------------------------------- >Comment By: XiaoPeng Wang (daniceexi) Date: 2009-10-15 15:06 Message: I verified the hardware related commands on following JS blade hardware, no issue was found. So we can recommend that user should use these firmware version when they use xCAT as the management software. And I'll verify these hardware commands on the JS23 and JS43 when test environment is ready. Hardware type and relevant firmware version: JS21 Blade AMM BPET51F FW/BIOS MB246_060 JS22 Blade AMM BPET51F FW/BIOS EA350_021 Tested hardware related commands of xCAT: rpower, rebootseq, getmacs, rconsole, rspconfig, rscan, rinv, rvitals CASES: rpower js21n03 status rpower js21n03 stat rpower js21n03 off rpower js21n03 on rpower js21n03 reset rbootseq js21n03 hd,net rbootseq js21n03 net,hd rbootseq js21n03 hd0,net getmacs js21n03 rcons js21n03 rspconfig mm_js_1 snmpcfg=enable sshcfg=enable rspconfig mm_js_1 pd1=redwoperf pd2=redwoperf rscan mm_js_1 rscan mm_js_1 -z rinv js21n03 rvitals js21n03 ---------------------------------------------------------------------- Comment By: XiaoPeng Wang (daniceexi) Date: 2009-10-13 21:31 Message: I'll verify hardware functions on the latest firmware for js22 blade. ---------------------------------------------------------------------- Comment By: Bruce (bp-sawyers) Date: 2009-10-12 08:27 Message: Jarrod says: BPET50C badly breaks rbootseq. BPET51E restores it in my experience. I do not know when the BPET51 branch is going to ship personally. ---------------------------------------------------------------------- Comment By: Connie Graff (cgraff) Date: 2009-10-10 02:09 Message: I have re-opened this defect and changed the summary field, Based on my resolution of the rspconfig error I found in the original bug reported here and with other errors I have seen in the getmacs and rbootseq commands it is clear that xCAT 2.3 will not function with jst any level of AMM firmware. Before we can support xCAT 2.3 on Blades, we need to identify the following: 1. AMM firmware level 2. Blade FW/BIOS 3. I/O Module Firmeare 4. Mellanox adapter firmware (for HPC customers with IB fabric) It may be that not all of those are of equal importance, and we may find more. >From what I have seen so far, the AMM firmware is a must. ---------------------------------------------------------------------- Comment By: Connie Graff (cgraff) Date: 2009-10-07 00:46 Message: You can cancel this one. With a pointer from Shujun I was able to convince myself the problem with on my AMM, not xCAT. I was not able to set the power policy from the AMM command line or from the Web interface. After upgrading the firmware on the three AMMs in the cluster, I am able to run rspconfig on two of the three: root@c954mgrs1:/ > rspconfig c954c3mm1 pd1=redwoperf pd2=redwoperf c954c3mm1: pd2: redwoperf c954c3mm1: pd1: redwoperf On c954c1mm1, I am able to change the power policy for pd2 from the command line or from the Web interface. I am able to change pd1 to a different policy but not to the one xCAT wants. I will most likely have to set c954c1mm1 back to the factory default to see if that will fix the issue. ---------------------------------------------------------------------- Comment By: Connie Graff (cgraff) Date: 2009-10-06 02:11 Message: I keep thinking the problem is linked to the network changes we made last week. I know I did not update the AMM network adapters to reflect the netmast and default route change before my initial test. Friday I made that change on c954c1mm1, but I still got the error. Over the weekend, it occurred to me that the arp data for c954c1mm1 was likely stale. I checked and found it was. I deleted the state arp entry for c954c1mm1 and ran the rspconfig command again for that AMM. Once again it failed the same way. I have no new ideas, but I am still thinking! ---------------------------------------------------------------------- Comment By: Brian Croswell (bcroswell) Date: 2009-10-03 03:27 Message: Connie ... I don't know how your C954 blade configuration is setup .. It looks like BC MM is active for c954c1mm1 9.114.70.1 But it looks like the other BC MMs C954c2mm1 (9.114.70.19) C954c3mm1 (9.114.70.37) are currently not active where the ping is failing to the IPs .. Can you try and execute doing the rspconfig to work with MM node c954c1mm1 and work with your network admin or PPSLAB admin to check the other 2 MMs that are not able to connect. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=1006945&aid=2871929&group_id=208749 |