Recent changes to Cluster_Recovery

Cluster_Recovery modified by Lissa Valletta

Lissa Valletta — Wed, 23 Jul 2014 15:21:34 -0000

--- v42
+++ v43
@@ -325,8 +325,8 @@

 The first task is to make sure that the current backup xCAT SN is working in the P775 cluster, and admin will execute the xCAT SN failover tasks, where the xCAT compute nodes are allocated over to the xCAT SN backup node. After you accomplish the recovery of the failed xCAT SN, you will then want to reallocate the xCAT compute nodes back to their primary xCAT SN. This xCAT SN failover scenario is documented in the appropriate xCAT AIX/Linux Hierarchical Cluster documents. 

-    https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster
-    https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster 
+[Setting_Up_a_Linux_Hierarchical_Cluster]
+[Setting_Up_an_AIX_Hierarchical_Cluster] 

 The recovery flow of the xCAT SN, is to first debug what the issue is of the xCAT SN, where it is best to recover the xCAT SN without walking through the FIP process. But if there is an issue with the P775 xCAT SN octant, the admin will need to work closely with the IBM PE service representative to understand the xCAT SN hardware failure. The FIP recovery is to power down the P775 CECs for both the failed xCAT SN and for the P775 CEC where the FIP available octant is located. The IBM PE will then makes sure the appropriate physical I/O ethernet and disk resources are moved to the new FIP avaialbe octant. The xCAT admin will make sure the I/O ethernet adapter and the SAS disk I/O resources using xCAT "chvm" are allocated to a new designated FIP node octant. They will execute xCAT swapnodes command where you place the bad octant into fip_defective node group, and aloocate new octant as the new xCAT SN node definition. Based on the xCAT SN hardware failure, the admin may need to do multiple tasks. For a new ethernet adapters, they will need to retrieve new ethernet MAC address for the xCAT SN. For a new replaced SAS disk, the xCAT admin will need to reinstall the xCAT SN using the same xCAT OS disk full image used by the previous xCAT SN octant. If the xCAT SN needed to be moved to a different P775 cec, than the admin needs to make sure the VPD and the remote power attributes are reflected in the xCAT SN object. One the new xCAT SN is properly up and running to the satisfaction of the xCAT administrator they can plan to execute the xCAT SN fail over tasks to move the xCAT compute nodes back to the rebuilt primary xCAT SN. 
@@ -343,8 +343,8 @@
         rinv cec1 deconfig . 
     (1)Admin has executed manual xCAT SN fail over using multiple xCAT commands including "snmove" to have compute nodes use the
        backup xCAT SN. These tasks are defined in the Hierarchical Cluster documentation in section "Using a backup service node". 
-        https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
-        https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster 
+[Setting_Up_an_AIX_Hierarchical_Cluster]
+[Setting_Up_a_Linux_Hierarchical_Cluster] 
     (2) Admin contacts IBM Service indicating that there is a bad xCAT SN octant where a PE person will be available to
         physically move the I/O resources from octant 0 (LPARid 1) to octant 1 (LPARid 5) in cec1.
     (3)Drain any compute notes found on cec1 and remove the LL resource group. Admin will then power off cec1. 
@@ -394,8 +394,8 @@
         rinv cec1 deconfig   
     (1)Admin has executed manual xCAT SN fail over using multiple xCAT commands including "snmove" to have compute nodes use the
        backup xCAT SN. These tasks are defined in the Hierarchical Cluster documentation in section "Using a backup service node". 
-        https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
-        https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster 
+[Setting_Up_an_AIX_Hierarchical_Cluster]
+[Setting_Up_a_Linux_Hierarchical_Cluster] 
     (2) Admin contacts IBM Service indicating that there is a bad xCAT SN octant where a PE person will be available to
         physically move the I/O resources from octant 0 (LPARid 1) to octant 1 (LPARid 5) in cec1.
     (3) Since the admin needs to use the FIP available node on a different CEC,we need to drain any compute nodes found on cec1 and

Cluster_Recovery modified by Lissa Valletta

Lissa Valletta — Mon, 23 Jun 2014 15:35:45 -0000

--- v41
+++ v42
@@ -1,6 +1,8 @@
 [TOC]

 # Introduction
+
+**Note: This document is no longer updated. Refer to** [Power_775_Cluster_Recovery] for current information. ****

 ## Overview

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:44 -0000

--- v40
+++ v41
@@ -383,7 +383,7 @@
         that is described in step 1  of this scenario .

-### xCAT SN Disk replacement on scenario a Different CEC
+### xCAT SN Disk replacement scenario on a Different CEC

 This FIP scenario specifies the xCAT admin activity required to move or replace a bad octant being supported as an xCAT SN working in a different P775 CEC. The expectation is that the xCAT admin has noted that there is a failure with both a disk,and that the octant 0 (LPAR id 1) with xCAT SN "xcatsn1" is not working in "cec1". The P775 admin identified an available FIP octant 0 (LPAR id 1) with xCAT node "fipcec2n1" which is available is a different CEC "cec2" that can be used to setup the xCAT SN.

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:44 -0000

--- v39
+++ v40
@@ -383,10 +383,9 @@
         that is described in step 1  of this scenario .

-### xCAT SN Disk replacement on a Different CEC scenario
-    
-    This FIP scenario specifies the xCAT admin activity required to move or replace a bad octant being supported as an xCAT SN working in a different P775 CEC. The expectation is that the xCAT admin has noted that there is a failure with both a disk,and that the octant 0 (LPAR id 1) with xCAT SN "xcatsn1"  is not working in "cec1". The P775 admin identified an available FIP octant 0 (LPAR id 1) with xCAT node "fipcec2n1" which is available is a different CEC "cec2" that can be used to setup the xCAT SN.    
-    
+### xCAT SN Disk replacement on scenario a Different CEC
+
+This FIP scenario specifies the xCAT admin activity required to move or replace a bad octant being supported as an xCAT SN working in a different P775 CEC. The expectation is that the xCAT admin has noted that there is a failure with both a disk,and that the octant 0 (LPAR id 1) with xCAT SN "xcatsn1" is not working in "cec1". The P775 admin identified an available FIP octant 0 (LPAR id 1) with xCAT node "fipcec2n1" which is available is a different CEC "cec2" that can be used to setup the xCAT SN. 

     *  Admin has noted a failure with cec1 octant0 where communication is lost to HFI and the disk is bad on xCAT SN "xcatsn1".
         It was found that xCAT SN xcatsn1 is seeing decongifured resources in octant 0 for cec1.

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:43 -0000

--- v38
+++ v39
@@ -383,9 +383,62 @@
         that is described in step 1  of this scenario .

-### xCAT SN Disk scenario
-    
-    TBD  Need  LoadL  and xCAT  commands
+### xCAT SN Disk replacement on a Different CEC scenario
+    
+    This FIP scenario specifies the xCAT admin activity required to move or replace a bad octant being supported as an xCAT SN working in a different P775 CEC. The expectation is that the xCAT admin has noted that there is a failure with both a disk,and that the octant 0 (LPAR id 1) with xCAT SN "xcatsn1"  is not working in "cec1". The P775 admin identified an available FIP octant 0 (LPAR id 1) with xCAT node "fipcec2n1" which is available is a different CEC "cec2" that can be used to setup the xCAT SN.    
+    
+    
+    *  Admin has noted a failure with cec1 octant0 where communication is lost to HFI and the disk is bad on xCAT SN "xcatsn1".
+        It was found that xCAT SN xcatsn1 is seeing decongifured resources in octant 0 for cec1.   
+        rinv cec1 deconfig   
+    (1)Admin has executed manual xCAT SN fail over using multiple xCAT commands including "snmove" to have compute nodes use the
+       backup xCAT SN. These tasks are defined in the Hierarchical Cluster documentation in section "Using a backup service node". 
+        https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
+        https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster 
+    (2) Admin contacts IBM Service indicating that there is a bad xCAT SN octant where a PE person will be available to
+        physically move the I/O resources from octant 0 (LPARid 1) to octant 1 (LPARid 5) in cec1.
+    (3) Since the admin needs to use the FIP available node on a different CEC,we need to drain any compute nodes found on cec1 and 
+        and cec2, and remove them from the LL resource group. Admin will then need to power off cec1 and cec2. 
+        ll commands
+        rpower cec1,cec2  off
+    (4) PE Rep will do physical update to cec1 where the disk is replaced, and xCAT SN required I/O resources ethernet and disk 
+        adapters are physically installed and  allocated to octant 0 in cec2. We will use the HFI interfaces in octant 0 in cec2.
+        The IBM PE service team needs to note which I/O slots resources have been moved.
+    (5) Admin executes the xCAT commands swapnode command to have xCAT SN xcatsn1 to now use FIP node fipcec2n1 settings with cec2
+        octant 0 (lpar id 1) resources in xCAT DB. This will indicate that xCAT SN xcatsn1 will have a new VPD and MTMS attributes.
+        The FIP node fipcec2n1 will now take ownership of the bad octant 0 in cec1.
+        It is also a good time to make updates to the FIP node groups fip_defective and fip_available for node "fipcec2n1"
+          swapnodes -c xcatsn1 -f fipcec1n5
+          chdef  -t group  -o fip_defective  members=fipcec2n1
+          chdef  -t group  -o fip_available -d members=fipcec2n1
+    (6) Admin powers up cec1 and cec2 so resources can be seen. He executes lsvm cec2 to note octant resources. If changes 
+        are needed, admin executes lsvm on xCAT SN to produce output file, then updates file to represent proper I/O setting.
+        Admin then executed inputs the file working with chvm command. 
+         rpower  cec1,cec2 on 
+         lsvm    cec2 
+         lsvm   xcatsn1 >/tmp/xcatsn1.info
+         edit /tmp/xcatsn1.info .. Make updates for octant information, and save file
+         cat  /tmp/xcatsn1.info | chvm  xcatsn
+         lsvm  xcatsn1
+    (7) Admin executes "getmacs"  command to validate that proper MAC address of the ethernet adapteris found. Make sure this
+        MAC address is placed in the xcatsn1 node object. The admin will want to recreate xcatsn1 nim object to reflect new MAC
+        interface if working with AIX cluster.
+          getmacs  xcatsn1 -D
+          lsdef xcatsn1 -i mac
+          xcat2nim xcatsn1 -f  (AIX only)
+    (8) Since the disk subsystem was affected, we will need to reinstall the xCAT SN xcatsn1 on the new disk. The admin will
+        need to validate all of the service node and installation attributes are properly defined. They will need to executee
+        a diskful installation on the xCAT SN. Please reference the proper xCAT SN Hierarchical Cluster documentation.  
+        The admin should do a thorough checkout making sure all xCAT xCAT SN environments (ssh, DB2, and installation) are working
+        properly after the xCAT SN installation. 
+           lsdef xcatsn1       (check all install and SN  attributes)
+           rnetboot xcatsn1    (execute network boot to reinstall xcatsn1 on cec2)
+           ssh root@xcatsn1   (try to login and validate OS and xCAT commands)         
+    (9) Once the admin has validated that the xCAT SN xcatsn1 is running properly, they can schedule the appropriate time to execute
+        manual xCAT SN fail over task to have the selected compute nodes move from the backup xCAT SN.  This is the same activity
+        that is described in step 1  of this scenario . The admin should plan to reinstall the diskless compute nodes working 
+        with the rebuilt xcatsn1. They can also reinstate the good compute nodes in cec1 and cec2 into the LL resources except for
+        bad octant 0 in cec1 (fipcec2n1).

 ## FIP GPFS Node Implementation

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:43 -0000

--- v37
+++ v38
@@ -394,200 +394,3 @@

     TBD  Need GPFS and xCAT  commands

-
-### Original FIP data
-    
-     Keep for reference will delete .. 
-    
-
-  
-The administrator can use "mkdef -t group -o fip_defective" to create a fip_defective group. And the fip_defective group will be used to mark the nodes which have a defect before, and the nodes could not be used as non-compute node again. 
-
-If a non-compute node failure comes out , the following steps will be done by administrators: 
-
-(1)determine the nodename for the failed octant/lpar If the failure lpar is the last lpar in the CEC, the administrators use rmhwconn to disconnect failed lpar from hdwr_svr 
-
-(2)choose a new octant for the location of this node through lsdef/nodels command: (2.1)If there isn't a fully functional octant in the CEC, the administrators will choose an alternate CEC and move PCI cards to the same slot numbers in the new CEC manually; and also use lshwconn/mkhwconn to make sure the hdwr_svr connection to new cec; and then switch to step(3); (2.2)If there is a fully functional octant in the CEC and this is not a PCI failure, the administrator will switch to step(3); (2.3)If there is a fully functional octant in the CEC and this is a PCI failure, call home for SSR to move PCI cards to the fully functional octant , and the administrator will switch to step(3); 
-
-(3)Drain the fully functional compute octant in the CEC-- rpower off the functional compute node 
-
-(4)swap the the failure node and the node in the available FIP resource(if one way, please use the swapnodes with -o option): (4.1)If they are in the same CEC, 
-
-(4.1.1) Split the new chosen octant by running a mkvm cmd for example: mkvm compute2,tmpnode -id 5 -r 1:5 (note that tmpnode is a new node name that didn't exist before) 
-
-(4.1.2)Swap the service node attributes and IO assignments: swapnodes -c sn1 -f compute2 
-
-(4.1.3) Swap the utility node attributes and IO assignments: swapnodes -c util1 -f tmpnode 
-
-(4.1.4) switch to step(5) 
-
-(4.2)If they are not in the same CEC, after running swapnodes, the administrator should use lsvm/chvm to assign the related I/O slots to the current_node in the new CEC; and then switch to step(5); 
-
-(5) Run "chdef fip_node groups=...,fip_defective" to put the fip_node into fip_defective group; 
-
-(6)The order may change after IO re-assignment, so the administrators need to run rbootseq to set the boot string for the current_node; 
-
-(7)If the non-compute node is a service node, (7.1)If a stateful (diskful) service node, (7.1.1)reinstall the service node, or boot the service node from the disk. (7.1.2) reboot its compute nodes if they are stateless/statelite nodes. (7.2)If it's a linux diskless service node, 
-
-(7.2.1 ) boot the service node (7.2.2)reboot the compute nodes if they are nfs-based statelite. If they are ramdisk-based stateless or statelite (only available on linux), the compute nodes shouldn't need rebooting. 
-
-(8)If the node-compute node is a storage node, boot the node in its new location. 3.2 swapnodes command 3.2.1 main function 
-
-The swapnodes command will keep the current node name in the xCAT table, and use the FIP_node's hardware resource. Besides that, the IO adapters will be assigned to the new hardware resource if they are in the same CEC. 
-
-So the swapnodes command will be designed to do 2 things: (1)swap the location info in the db between 2 nodes: all the ppc table attributes (including hcp, id, parent (cec), supernode) all nodepos table attributes(including rack,u,chassis,slot,room) (2)swap the IO adapters between the 2 lpars (only for swapping within a cec) using fsp-api 
-
-The swapnodes command shouldn't make the decision of which 2 nodes are swapped. It will just received the 2 node names as cmd line parameters. 
-
-After running swapnodes command, the order may change after IO re-assignment, so the administrator needs to run rbootseq to set the boot string for the current_node. And then boot the node with the same image and same postscripts because they have the same attributes. 3.2.2 External 
-
-The following part will focus on the external interface of swapnodes command. Its syntax is as following: 
-
-1) SYNTAX 
-
-swapnodes [-h| --help] 
-
-swapnodes -c current_node -f fip_node [-o] 
-
-2) OPTIONS 
-
--c current_node --- the failure node 
-
--f fip_node --- the node which will be swapped as a non-compute node 
-
--o one way. Move the current_node definition to fip_node (the 2nd octant), and not move the fip_node definition to the 1st octant. 
-
-3) OUTPUT 
-
-current_node: success 
-
-## List the deconfigured resources (The NetC interface is under discussion)
-
-4.1 main function 
-
-Deconfigured resources are hw components (cpus, memory, etc.) that have failed so the firmware has automatically turned those components off. The FIP architecture requires that we have the ability to easily list this information and display it for the SE to figure out what is deconfigured in a given server. 
-
-There will be a NETC interface is for querying the deconfigured resources in p7IH. 
-
-xCAT will provide a HMC Alternative command(fsp-api) to query the deconfigured resources through NETC. Currently there is no such NETC command. It is under discussion.(From John's mail), (1)how to call the corresponding chic interface through NETC (2)what information we would get back from this interface? 4.2 External 
-
-This function will be implemented in rinv command. 
-
-The NETC interface is under discussion, so I just list some basic components in the command. Once the NETC interface is finished, we should add more components. It's syntax is as following. 
-
-1)SYNTAX 
-
-rinv noderange deconfig 
-
-  
-2)OPTIONS 
-
-deconfig - list all the deconfigure resources 
-
-3)OUTPUT: nodename: procdecfg: value 
-
-nodename: memdecfg: value 
-
-..... 
-
-## Getting the hardware VPD data from each CEC for the cluster
-
-5.1 main function 
-
-This requirement is to get the hardware VPD data and collect it from each CEC for the cluster to have a single collection of all VPD. The intention is that this could be used by the SE or sent back to IBM to help with problems that are called in. 
-
-Our goal is to get a much richer list of the hardware available. (For example, like the output lsvpd on an AIX system). This is needed for some of the ISNM and TEAL support which we are integrating into our systems. The command provide the following information: (1)System MTMS (2)All P7 chip FIP Resource Location Code, module part number and serial number (3)All Torrent Hub module FIP Resource Location Code, part number and serial number. 
-
-xCAT will provide a fsp-api to query this information through NETC, and will use rinv with vpd to invoke this fsp-api. 5.2 External 
-
-This function will be implemented in rinv command with vpd function. 
-
-Its syntax is as following: 
-
-1)SYNTAX 
-
-rinv noderange vpd 
-
-NOTE: the noderange only could be CECs. 
-
-2)OUTPUT 
-
-There are two styles of the format of the command's output. I choose the option 2. 
-
-Option 1: 
-
-nodename: 
-    
-               Hardware VPD Data:
-    
-    
-               System MTMS: value
-    
-    
-               Processor 1 : Processor Location Code : value
-    
-    
-                           Processor Part Number: value
-    
-    
-                           Processor Serial Number: value
-    
-    
-               Processor 2 : Processor Location Code :
-    
-    
-                           Processor Part Number:
-    
-    
-                           Processor Serial Number:
-    
-    
-                   .....
-    
-    
-               Torrent 1 :Torrent Location Code : value
-    
-    
-                           Torrent Part Number: value
-    
-    
-                           Torrent Serial Number: value
-    
-    
-               Torrent 2 : Torrent Location Code : value
-    
-    
-                           Torrent Part Number: value
-    
-    
-                           Torrent Serial Number: value
-    
-    
-                           ....
-    
-
-Option 2: 
-
-nodename: Hardware VPD Data 
-
-nodename: System MTMS 9125-F2C*P7IH028 
-
-nodename: processor id Processor Location Code Processor Part Number Processor Serial Number 
-
-nodename: 0 
-
-nodename: 1 
-
-nodename: 2 
-
-nodename: ... 
-
-nodename: Torrent id Torrent Location Code Torrent Part Number Torrent Serial Number 
-
-nodename: 0 
-
-nodename: 1 
-
-nodename: 2 
-
-nodename: ....

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:42 -0000

--- v36
+++ v37
@@ -339,9 +339,10 @@
     *  Admin has noted a failure with cec1 octant0 where communication is lost to HFI and ethernet on xCAT SN "xcatsn1".
         It was found that xcatsn1 is seeing decongifured resources in octant 0 for cec1.   
         rinv cec1 deconfig . 
-    (1)Admin has executed manual xCAT SN fail over using "snmove" to have compute nodes use the backup alternate SN
-    https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster#Using_a_backup_service_node
-    https://sourceforge.net/apps/mediawiki  /xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster#Appendix_A:_Setup_backup_Service_Nodes 
+    (1)Admin has executed manual xCAT SN fail over using multiple xCAT commands including "snmove" to have compute nodes use the
+       backup xCAT SN. These tasks are defined in the Hierarchical Cluster documentation in section "Using a backup service node". 
+        https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster
+        https://sourceforge.net/apps/mediawiki/xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster 
     (2) Admin contacts IBM Service indicating that there is a bad xCAT SN octant where a PE person will be available to
         physically move the I/O resources from octant 0 (LPARid 1) to octant 1 (LPARid 5) in cec1.
     (3)Drain any compute notes found on cec1 and remove the LL resource group. Admin will then power off cec1.

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:41 -0000

--- v35
+++ v36
@@ -332,9 +332,54 @@

-### xCAT SN Ethernet/HFI scenario
-    
-    TBD  Need  LoadL  and xCAT  commands   
+### xCAT SN Ethernet/HFI scenario in same CEC
+
+This FIP scenario specifies the xCAT admin activity required to move or replace a bad octant being supported as an xCAT SN working in the same P775 CEC. The expectation is that the xCAT admin has noted that there is a failure with both an ethernet adapter and that the octant 0 (LPAR id 1) with xCAT SN "xcatsn1" is not working. The P775 admin has identified an available FIP octant 1 (LPAR id 5) xCAT node "cec1n5" in the same CEC "cec1" can be used to setup 
+    
+    *  Admin has noted a failure with cec1 octant0 where communication is lost to HFI and ethernet on xCAT SN "xcatsn1".
+        It was found that xcatsn1 is seeing decongifured resources in octant 0 for cec1.   
+        rinv cec1 deconfig . 
+    (1)Admin has executed manual xCAT SN fail over using "snmove" to have compute nodes use the backup alternate SN
+    https://sourceforge.net/apps/mediawikixcat/index.php?title=Setting_Up_an_AIX_Hierarchical_Cluster#Using_a_backup_service_node
+    https://sourceforge.net/apps/mediawiki  /xcat/index.php?title=Setting_Up_a_Linux_Hierarchical_Cluster#Appendix_A:_Setup_backup_Service_Nodes 
+    (2) Admin contacts IBM Service indicating that there is a bad xCAT SN octant where a PE person will be available to
+        physically move the I/O resources from octant 0 (LPARid 1) to octant 1 (LPARid 5) in cec1.
+    (3)Drain any compute notes found on cec1 and remove the LL resource group. Admin will then power off cec1. 
+        ll commands
+        rpower cec1  off
+    (4) PE Rep will do physical update to cec1 where the ethernet adapter is replaced and, I/O resources allocated to octant 1
+       (LPARid 5). The IBM PE service team needs to note which I/O slots resources have been moved.
+    (5) Admin executes the xCAT commands swapnode command to have xCAT SN xcatsn1 to now use FIP node fipcec1n5
+        octant 1 (lpar id 5) resources in xCAT DB. The FIP node fipcec1n5 will now take ownership of the bad octant 0.
+        It is also a good time to make updates to the FIP node groups fip_defective and fip_available for node "fipcec1n5"
+          swapnodes -c xcatsn1 -f fipcec1n5
+          chdef  -t group  -o fip_defective  members=fipcec1n5
+          chdef  -t group  -o fip_available -d members=fipcec1n5
+    (6) Admin powers up cec1 to standby so resources can be seen. He executes lsvm cec1 to note octant resources. If changes 
+        are needed, admin executes lsvm on xCAT SN to produce output file, then updates file to represent proper I/O setting.
+        Admin then executed inputs the file working with chvm command. 
+         rpower  cec1 onstandby 
+         lsvm    cec1 
+         lsvm   xcatsn1 >/tmp/xcatsn1.info
+         edit /tmp/xcatsn1.info .. Make updates for octant information, and save file
+         cat  /tmp/xcatsn1.info | chvm  xcatsn
+         lsvm  xcatsn1
+    (7) Admin executes "getmacs"  command to retrieve the new MAC address of the new ethernet adapter. Make sure this MAC address
+        is placed in the xcatsn1 node object. The admin will want to recreate xcatsn1 nim object to reflect new MAC interface if
+        working with AIX cluster.
+          getmacs  xcatsn1 -D
+          lsdef xcatsn1 -i mac
+          xcat2nim xcatsn1 -f  (AIX only)
+    (8) Since the disk subsystem was not affected, there is a good chance that you should beable to power up the xCAT SN and other  
+        compute  node octants located on the cec1. The admin should do a thorough checkout making sure all xCAT xCAT SN 
+        environments (ssh, DB2, and installation) are working properly.  It is a good test to execute xCAT updatenode command
+        against the xCAT SN.  If the xCAT SN is not working properly, the admin may want to do a reinstall on the xCAT SN.
+           rpower xcatsn1  on 
+           ssh root@xcatsn1   (try to login and validate OS and xCAT commands) 
+           updatenode  xcatsn1  
+    (9) Once the admin has validated that the xCAT SN xcatsn1 is running properly, they can schedule the appropriate time to execute
+        manual xCAT SN fail over task to have the selected compute nodes move from the backup xCAT SN.  This is the same activity
+        that is described in step 1  of this scenario .

 ### xCAT SN Disk scenario

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:41 -0000

--- v34
+++ v35
@@ -317,7 +317,7 @@
     TBD  Need  LoadL  and xCAT  commands

-## FIP xCAT Service Node Implementation
+## P775 FIP xCAT Service Node Implementation

 The FIP activity with P775 xCAT SN nodes/octants is to make sure that we have proper FIP nodes available in the P775 CEC that contains an available I/O resources including an ethernet adapter, and the SAS disk adapters. Since the xCAT SN is very prominent in the P775 cluster, the administrator needs to actively recover the xCAT SN very quickly. It would be advantageous to have the available FIP resources made available in the same P775 CEC as the current xCAT SN. This will help the recovery where only one P775 cec will need to be brought down when looking to reorganize the I/O resources to a new FIP available resource. Since the xCAT SN CEC has most of the I/O resources, the complexity of the xCAT SN failure will require additional debug activity, and many additional xCAT administrator tasks to recover the xCAT SN.

Cluster_Recovery modified by Brian Croswell

Brian Croswell — Mon, 23 Jun 2014 15:35:41 -0000

--- v33
+++ v34
@@ -308,17 +308,14 @@
     TBD  Need  LoadL  and xCAT  commands (include power up, and compute node installation)

-## FIP Login Node Implementation
+## P775 FIP Login Node Implementation

 The FIP activity with P775 Login nodes/octants is to make sure that we have proper FIP nodes available in the P775 CEC that contains an available ethernet adapter, and is available to the same xCAT SN. As with compute nodes, the login nodes are diskless, so they contain octant resoureces of HFI, CPU and memmory. But they do need to have and ethernet I/O resource be included in the octant configuration. There are multiple P775 login nodes in the cluster, so the plan is that the P775 administrator will instruct the users to use one the other login nodes, while they rebuild a new login node from an available FIP resource. Based on the login node failure, the admin will need to locate an FIP octant, and then make sure the I/O ethernet adapter resource using xCAT "chvm" to a new designated FIP node octant. They will execute xCAT swapnodes command where you place the bad octant into fip_defective node group, and aloocate new octant as the new login node definition. The admin will need to see if the new login node requires a new ethernet "MAC" address to be used, and they will install the new P775 login node from the appropriate xCAT SN with same OS diskless image that was used with the previous login node used. 

 ### Login node Replace Ethernet/HFI scenario

-    TBD  Need  LoadL  and xCAT  commands   
-    
-
-  
-
+    TBD  Need  LoadL  and xCAT  commands
+    

 ## FIP xCAT Service Node Implementation