The xCAT Rolling Update support requires the Tivoli Workload Scheduler LoadLeveler to schedule the nodes for update. LoadLeveler has been superseded by LSF in the IBM HPC Software stack, so this xCAT function is no longer supported.
This document provides an overview of the xCAT Rolling Update support available in xCAT 2.6.6 and later releases for AIX and pLinux clusters. This support is available in xCAT 2.5 as beta for AIX Clusters.
The xCAT rolling update support allows you to update the OS image on a subset of cluster nodes at a time, such that the remaining nodes in the cluster can still be running jobs. This process uses the Tivoli Workload Scheduler LoadLeveler to determine node availability and control the update process.
Based on input from the xCAT administrator, LoadLeveler flexible reservations are created for specific sets of nodes, and xCAT will update a set of nodes when its reservation becomes active.
xCAT rollupdate support provides the framework to update sets of nodes in your cluster. Nodes to be updated can be specified in one of two ways:
Both approaches allow you to limit the number of nodes to be updated at any one time, to provide a list of commands that can be run before the nodes in an update group are shut down (prescripts) and a list of commands that can be run after the nodes have been powered off, but before they are rebooted (outofbandcmds).
Prescripts can be useful to notify applications that nodes will be going down, and to move critical services to backup nodes before they are shut down.
Out-of-band commands can be used to perform operations that require the nodes to be powered down, such as firmware updates.
Restrictions:
The following prerequisites must be set up before running the xCAT rolling update support:
If the xCAT management node is not the LoadLeveler central manager, you must do the following:
Set up your LoadLeveler central manager as an xCAT client machine so that it can run the xCAT runrollupdate command. To set up a remote xCAT client, see the xCAT wiki how-to Granting Users xCAT privileges & Setting Up a Remote Client. This includes installing the xCAT client and prerequisite rpms, and setting up the correct port communications with the xCAT management node. If your LoadLeveler Central Manager is an xCAT service node, you do not need to do anything special to set up this remote client -- all xCAT service nodes should already be able to run xCAT commands.
Create a /etc/LoadL.cfg file that is capable of updating the xCAT database:
cat /etc/LoadL.cfg
LoadLUserid = <your LL admin userid>
LoadLGroupid = <your LL admin group>
LoadLDB = xcatdb
Note that in order to run the LoadLeveler database configuration option, you should have already set up ODBC support for LoadLeveler on your xCAT management node.
Create a LL machine definition for your xCAT management node. If you do not want any LL daemons to run on the MN (recommended), set the LL machine attributes:
SCHEDD_RUNS_HERE=false
STARTD_RUNS_HERE=false
Start the LoadLeveler master daemon on your xCAT management node:
llctl start
Make sure at least one LoadLeveler schedd machine is set up as a public schedd by setting the LL machine attribute for that machine:
SCHEDD_HOST=true
The LoadLeveler userid specified in the xCAT rollupdate input scheduser stanza must be authorized to run the xCAT "runrollupdate" command. See the xCAT wiki how-to Granting_Users_xCAT_privileges to set this up. This includes creating SSL certificates for the user and changing the xCAT policy table.
The input to the xCAT rollupdate command is a stanza file piped through STDIN. xCAT provides sample stanza files in:
/opt/xcat/share/xcat/rollupdate
The rolling update support provides two different methods of specifying nodes to be update: The updateall method to update all specified nodes in any order as they become available, and the standard method requiring explicit update groups identifying nodes to be updated together. Different stanza keywords apply to these different methods and are noted below.
Stanza input is specified as:
keyword = value
with one keyword per line. Unless otherwise noted in the descriptions below, if multiple stanza lines are specified for the same keyword, only the FIRST entry will be used and all others will be ignored.
Valid keywords are:
scheduler = scheduler
where scheduler is the job scheduler used to submit the rolling update jobs. Currently only "loadleveler" is supported.
scheduser = scheduser
where scheduser is the userid with authority to submit scheduler jobs. Note that LoadLeveler does not allow reservation jobs to be submitted by the root userid.
oldfeature = feature value
(optional) where feature value is an existing LoadLeveler feature value that is set for the nodes being updated. xCAT will remove this value from the LoadLeveler machine definition after the update has completed.
newfeature = feature value
(optional) where feature value is a new LoadLeveler feature value to be set for the nodes being updated. xCAT will add this value to the LoadLeveler machine definition after the update has completed. This can be useful for users that wish to schedule jobs that can only be run on nodes that have been updated.
updateall = yes | no
Specifies whether this rolling update request is for the updateall method of specifying nodes. Default is no.
This method should be used for simple compute node updates that have no special dependencies on other nodes and the update order is not important. Only those nodes that are currently active in the scheduler will be updated.
If updateall=yes, the following stanza entries MUST also be specified:
updateall_nodes
updateall_nodecount
job_template
job_dir
updateall_nodes = xCAT noderange
Used with updateall=yes. The xCAT nodereange specifies the list of nodes that are to be included in this rolling update request (see the xCAT noderange man page).. All nodes must be active in the job scheduler in order to be updated.
updateall_nodecount = numeric value
Used with updateall=yes. The numeric value specifies the number of nodes that will be reserved at one time in the scheduler and updated together. The smaller the number, the more scheduler reservation jobs that will be submitted.
NOTE: LoadLeveler performance decreases with large numbers of reservations. Do not set this value so low that you will exceed the maximum number of reservations allowed for your cluster or that you will degrade LL performance for your production jobs. You must also ensure that the LL MAX_RESERVATIONS setting is large enough to handle all the reservations that will be created.
updategroup = name(noderange)
For standard updates, at least one updategroup or mapgroups stanza must be specified. The name specifies the name to be assigned to this update group. The noderange is an xCAT noderange that specifies the list of nodes that will be included in this update group (see the xCAT noderange man page).
Multiple updategroup stanzas may be supplied, one for each group of nodes to be updated.
mapgroups = nodegroup range
For standard updates, at least one updategroup or mapgroups stanza must be specified. . The nodegroup range specifies a list or range of xCAT nodegroup names. This field is processed in the same way the xCAT noderange processing works for node names, except that it will generate a list of xCAT nodegroup names. Each nodegroup in the list will become its own update group for this rolling update request, with the update group name set to the nodegroup name.
Multiple mapgroups stanzas may be supplied.
For example, the following will create 10 updategroups from the 10 nodegroups named block01 to block10:
mapgroups=block[01-10]
mutex = updategroup,updategroup,...
(optional) Mutual exclusion for standard updates. The comma-delimited list of updategroup names specify which update groups are mutually exclusive and must not be updated at the same time in order to maintain active resources within the cluster. By default, only one updategroup listed in the entry will be updated at a time (see mutex_count below to change this default).
You may list multiple mutex stanzas to identify different sets of mutual exclusion.
For example, the following states that the update processes for ns1 and for ns2 will not be allowed to run at the same time:
mutex=ns1,ns2
mutex = updategroup range,updategroup range,...
(optional) For standard updates. The comma-delimited list of updategroup ranges will be processed in the same way the xCAT noderange processing works for node names, except that it will generate a list of rolling update update group names. The first name in each range is paired together to make a mutual exclusion list, the second name in each range is paired together, etc.
For example, the following single entry:
mutex=block[1-3]a,block[1-3]b,block[1-3]c
would be equivalent to specifying these three entries:
mutex=block1a,block1b,block1c
mutex=block2a,block2b,block2c
mutex=block3a,block3b,block3c
nodegroup_mutex = nodegroup name
(optional) For standard updates. Mutual exclusion for any nodes in this xCAT nodegroup. For each updategroup listed above, if any nodes in that group are a member of this xCAT nodegroup, add it to the mutex entry.
For example, you specifiy:
nodegroup_mutex=IOservers
Where your xCAT nodegroup is defined as:
IOservers=n4,n8,n12
And your updategroups specified above are:
updategroup=CEC1(n1-n4)
updategroup=CEC2(n5-n8)
updategroup=CEC3(n9-n12)
updategroup=CEC4(n13-n16)
The following mutex will be created:
mutex=CEC1,CEC2,CEC3
By default, only one of these update groups will be updated at a time unless a different mutex_count is specified for this stanza (see mutex_count below).
nodegroup_mutex = nodegroup name range
(optional) For standard updates. Specify multiple nodegroup_mutex statements with a single stanza. The nodegroup name range is expanded to list of xCAT nodegroup names. This then becomes equivalent to multiple nodegroup_mutex stanzas.
For example, this stanza:
nodegroup_mutex=block[1-3]IO
would be equivalent to:
nodegroup_mutex=block1IO
nodegroup_mutex=block2IO
nodegroup_mutex=block3IO
Which, in turn, would be evaluated to create the correct mutex statements following the nodegroup_mutex processing described above.
mutex_count = numeric value
(optional) where numeric value is the number of update groups in a mutex statement that can be run at the same time. For example, if you have:
mutex=c1,c2,c3,c4
mutex_count=3
No more than 3 of the listed update groups may be processed at the same time, leaving at least one group of nodes active at all times.
The mutex_count stanza ONLY applies to the previous mutex or nodegroup_mutex stanza.
The default is 1.
translatenames = noderange:|pattern|replacement|
translatenames = noderange:/pattern/replacement/
(optional) If your scheduler will be using names for nodes that are different from xCAT node names (e.g. the scheduler is using a different administrative network), you will need to tell xCAT how to translate from xCAT node names to the node names registered with your scheduler.
pattern and replacement are perl regular expressions to be performed on the node names in noderange. See the xcatdb man page for more details on using regular expressions. Multiple translatenames stanzas are allowed. If an xCAT nodename exists in more than one noderange, the last translated value will be used.
For example, to translate names of the form "bb1s1" to "bb1sn1": translatenames=service:|bb(\d+)s(\d+)|bb($1)sn($2)|
To translate names of the form "node20" to "node20-hf2" translatenames=compute:/\z/-hf2/
maxupdates = numeric value | all
where _numeric value _is the maximum number of update groups that can be updated at one time (i.e. the maximum number of LoadLeveler rolling update reservations that can be active). This allows you to ensure you will always have enough computing resources in your cluster and that not all nodes will attempt to be updated at once.
A value of all specifies that there is no restriction.
reconfiglist = xCAT_nodelist
where xCAT_nodelist is the list of nodes (as known by xCAT) that xCAT will xdsh a LoadLeveler 'llctl reconfig' command to. xCAT will always send the reconfig command to the local xCAT management node, and to all nodes listed as the LL central managers and LL resource managers in the LL database. This is a list of additional machines required to immediately see any database changes xCAT may make. For example, all LL submit-only nodes should be added to this list so that any machine FEATURE changes are visible for job submission.
jobtemplate = filename
where _filename _is a filename with full directory path that identifies the scheduler job command file template to be used to submit reservations. See sample LoadLeveler templates in:
/opt/xcat/share/xcat/rollupdate/*.tmpl
It is recommended that you take a copy of one of these sample files and edit it for your cluster. You may need to add a "# @ CLASS" entry or other stanzas in order to properly run the reservation in your cluster.
The following substitution values will be replaced by the xCAT rollupdate command to generate a unique job command file for each update group:
**[NODESET]** \- the update group name for this reservation
**[JOBDIR]** \- the directory specified in the rollupdate input **jobdir** stanza
**[LLHOSTFILE]** \- (standard method, do NOT remove) the file generated by the xCAT rollupdate command that contains the list of LL machines in this update group that were available at the time the command was run.
**[[MUTEXRESOURCES]**] - (do NOT remove) The list of LL resources created by xCAT to handle mutual exclusion and maxupdates
**[LLCOUNT]** \- (required for updateall method, do NOT remove) used by xCAT to set the number of machines to reserve
**[UPDATEALLFEATURE]** \- (required by updateall method, do NOT remove) used by xCAT to control the rolling update
jobdir = directory
where _directory _is the directory to write the generated LoadLeveler job command files and other xCAT rolling update data files to. For LL, this directory needs to be on a filesystem available to all nodes.
reservationcallback = _/_opt/xcat/bin/rollupdate
INTERNAL KEYWORD used for development only. This is the reservation notify or callback command. For Loadleveler, this script must reside on the LoadLeveler central manager and will be called when the reservation for an updategroup becomes active.
reservationduration = time
where _time _is the maximum time to hold a LoadLeveler reservation for the update process. This value in minutes should be longer than the expected time to shutdown, update, and reboot all the nodes in an update group. xCAT will release the nodes from the reservation as they come back up, and will cancel the reservation when the last node has completed.
update_if_down = yes | no | cancel
Specifies whether nodes that are not active in the job scheduler should be updated.
For the rollupdate updateall method, only nodes with active startd daemons can be updated.
For the standard rollupdate method, only reservations for machines with status known to LoadLeveler can be created.
Default: update_if_down=cancel
prescript = command string
prescriptnodes = noderange
(optional) where _command _is the name of a command to be run on the xCAT management node before issuing the shutdown command for the nodes in the updategroup.
prescriptnodes is only supported with the standard rollupdate method. If it is also specified, the command will only be run for the nodes being updated from the updategroup that are also included in that xCAT noderange. If prescriptnodes is not specified (and for the updateall method), the command will be run for all the nodes in the updategroup.
For prescript, you may specify the string $NODELIST in the command string if you would like the comma-delimited list of xCAT nodenames passed into your command.
Prescripts can be used to run operations such as shutting down the global filesystem on all the nodes, or moving critical services to a backup server for specific nodes.
All prescripts must be executable by root.
Multiple prescript entries or prescript/prescriptnodes pairs of entries may be specified. Each command will be run in order.
skipshutdown = yes | no
Specifies whether sending a shutdown command to the nodes should be skipped. This should only be set to "yes" for updates to stateful nodes that are using rollupdate prescripts to apply updates to the nodes. No power down or reboot operation will be done, and no out-of-band scripts will be run. The bringupstatus/bringupappstatus values will be checked to determine when the update is complete after prescripts have run. Once the status is reached, the nodes will be removed from the scheduler reservation.
Default value is "no", all nodes will be shutdown and powered off.
shutdowntimeout = time
(optional) where _time _is the number of minutes xCAT should wait for an OS shutdown to complete before giving up and issuing a hard power off command and continuing with the rolling update process.
Default: shutdowntimeout=5
outofbandcmd = command string
outofbandnodes = noderange
(optional) where _command _is the name of a command to be run on the xCAT management node after the node has been shutdown but before it is rebooted.
outofbandnodes is only supported with the standard rollupdate method. If it is also specified, the command will only be run for the nodes being updated from the updategroup that are also included in that xCAT noderange. If outofbandnodes is not specified (and for the updateall method), the command will be run for all the nodes in the updategroup.
For outofbandcmd, you may specify the string $NODELIST in the command string if you would like the comma-delimited list of xCAT nodenames passed into your command.
Out-of-band commands can be used to run operations when nodes must be powered down such as firmware updates.
All out-of-band commands must be executable by root.
Multiple outofbandcmd entries or outofbandcmd/outofbandnodes pairs of entries may be specified. Each command will be run in order.
bringuporder = noderange
(optional for standard update method only) where the nodes being updated from the updategroup that are also included in that xCAT _noderange _will be brought up first.
If more than one node in the updategroup matches a bringuporder entry, they will be brought up at the same time.
Multiple bringuporder entries may be specified, and they will be processed in order, completing bringup of all nodes in the previous entry before starting to power on the nodes in this entry.
Any nodes in the update group that are not listed in a bringuporder entry will be brought up at the end. Note that bringuporder will only be applied to nodes within an update group and does NOT affect how the scheduler will schedule the order of processing different update groups.
bringupstatus = status value
OR
bringupappstatus = appstatus value
(optional) The xCAT database node status or node appstatus value that xCAT will check and will wait for to determine that the node is up. Once this status is reached, xCAT will continue bringing up more nodes (if bringuporder is set) and will release this node from the scheduler reservation. If both attributes are set, only bringupappstatus will be used.
Default: bringupstatus=booted
bringuptimeout = time
(optional) The maximum time in minutes xCAT should wait after issuing the rpower on command for the nodes to reach bringupstatus or bringupappstatus before giving up. If using bringuporder and this timeout is reached for one set of nodes, no additional nodes will be attempted to be brought up. The scheduler reservation will be cancelled.
Default: bringuptimeout=10
LoadLeveler must be running with the database configuration option in order to use the xCAT rolling update support.
The root userid on the xCAT management node must have LOADL_ADMIN privileges.
LoadLeveler provides many settings to support and control reservations in your job scheduler. Review the LoadLeveler documentation for a full list of these controls to ensure they are set correctly to support the flexible reservation jobs that will be submitted for xCAT rolling updates.
A few key settings are listed here:
SCHEDULER_TYPE=BACKFILL
MAX_RESERVATIONS
For each machine definition (or in the default machine definition):
reservation_permitted = true
For the scheduser LL userid specified in your xCAT rollupdate input stanzas (or the default LL user):
max_reservation_duration
max_reservation_expiration
max_reservations
If you do not have a schedd daemon running on your xCAT management node, at least one LL machine must be defined as a public scheduler, by setting SCHEDD_HOST for that machine:
SCHEDD_HOST=true
Since xCAT changes LL FEATURE values for machine definitions during the rolling update process, prior to LoadLeveler 5.1.0.1-3 you cannot use LL Machine Groups, and you must have explicit machine definitions for each node. In that case, you should run with machine_authenticate set:
MACHINE_AUTHENTICATE=true
Every node that is to be scheduled by LoadLeveler for updating must have a startd daemon running. If you do not wish to have any user jobs running on this node (e.g. this is a storage node), you should not allow any max_starters:
STARTD_RUNS_HERE=true
MAX_STARTERS=0
During the rolling update process, even though a node may be rebooted, the machine status will still appear to be running when queried with LL status commands. You should have your node configured to automatically start the LL daemons again after it reboots, either through an operating system mechanism such as /etc/inittab or chkconfig, or through an xCAT postscript for diskless nodes. Then when xCAT removes the node from the flexible reservation used to update the node, LL can start running jobs again on the updated node.
The following LL setting is not unique to xCAT Rolling Updates, but required for any LL cluster that has nodes defined with no swap space, which is the case for xCAT diskless or statelite nodes:
VM_IMAGE_ALGORITHM=FREE_PAGING_SPACE_PLUS_FREE_REAL_MEMORY
LL settings that may be used or changed by xCAT Rolling Updates:
FLOATING_RESOURCES
SCHEDULE_BY_RESOURCES
CENTRAL_MANAGER_LIST
RESOURCE_MGR_LIST
(for machines) FEATURE
The general process flow for the xCAT Rolling Update support is:
tail -f /var/log/xcat/rollupdate.log
xCAT appends to the log file with each rollupdate run, so you may wish to move or remove the log file before a new rollupdate process. cat <your stanza file> | rollupdate --verbose
If you want to test your input first to make sure it is what you want, and to view the LL reservation job command files and other data files generated by xCAT, run the command with the test option: cat <your stanza file> | rollupdate -test --verbose
The output files will be placed in the directory you specified in the jobdir stanza. The verbose keyword is optional, but allows you to view detailed progress. The rollupdate command will do the following:
Monitoring:
You can view the flexible reservations that xCAT submitted:
llqres
or
llqres -l -R <reservation id>
for a specific reservation. If any reservation is stuck in a "Waiting" status that you feel should be active, you can check the job associated with the reservation (find the job id from the llqres -l output above):
llq -s <job id>
and debug as you would for any LL reservation or job.
When a reservation becomes active, LoadLeveler will invoke the notify script created above. This will invoke the internal xCAT runrollupdate command for the update group.
Updating xCAT software as part of the rolling update process is not supported. The xCAT software on your management node can be updated without impacting your running cluster, and should be done manually. See Setting_Up_a_Linux_xCAT_Mgmt_Node for instructions on updating your xCAT management node.
In hierarchical clusters, the xCAT rolling update process should not be used to update the xCAT software on service nodes. See the xCAT hierarchical documents: Setting_Up_an_AIX_Hierarchical_Cluster or Setting_Up_a_Linux_Hierarchical_Cluster.
Restrictions:
If you are performing a rolling update in an xCAT hierarchical cluster, there are a few special considerations that you will need to address if your service nodes will be rebooted as part of the update process. Normally, you should try to update your service nodes manually outside of the xCAT rolling update process so that you have more control over the update. However, if you are performing CEC firmware updates, and will need to power down the CEC that contains your service node, you will need to think about the services your service node is providing to your compute nodes and how to plan your updates.
If at all possible, you should create your update groups such that a service node and all of the compute nodes it serves can be updated together at one time. When you use this approach, make sure to use the bringuporder stanza in the rollupdate command input to bring up your service node first so that it is running when your compute nodes start to come up.
If it is not possible to update an entire block of service node with compute nodes because you will lose critical cluster services, you will need to plan more carefully. First, you can only bring down a service node if you have some type of backup for the services it is providing your compute nodes, or if you can tolerate the temporary loss of those services. Some things to consider:
If your cluster is running IBM HPC software such as GPFS or LoadLeveler, you have additional cluster services that you will need determine how to update and how to keep active during an xCAT rolling update process. Whenever possible, all cluster infrastructure nodes should be manually updated before running the xCAT rolling update process for compute nodes.
The term "infrastructure nodes" will be used to refer to any nodes that are not compute nodes. These include:
It is assumed that most updates to infrastructure nodes can be applied without impacting user jobs running on compute nodes. If xCAT service nodes are stateful (full-disk install) and are also used to run other infrastructure services (such as LL region managers), updates to infrastructure software that runs on these nodes can be applied using the xCAT updatenode command without rebooting the node. For updates to other servers that are running diskless or statelite images (e.g. GPFS IO servers), these nodes can be rebooted individually to load a new OS image without impacting the rest of the cluster.
GPFS software can be migrated to a new level one node at a time without impacting the running GPFS cluster. In order to upgrade GPFS, the GPFS daemons must be stopped and all GPFS filesystems must be unmounted on the node. For GPFS infrastructure nodes, it is important to manage the updates such that all GPFS cluster services remain operational.
When updating GPFS, the following will need to be considered:
Please consult the GPFS documentation for updating your GPFS software.
LoadLeveler infrastructure nodes (nodes running the central manager, resource manager, and region manager daemons) are required to all be running the same version of LL. For maintaining critical cluster services, these daemons should all have backup servers. Locating both primary and backup servers on the xCAT management node or service nodes, and using full-disk install service nodes (i.e. not diskless) will ensure the best cluster stability during a rolling update process. In order to upgrade LL simultaneously on these nodes while still allowing jobs to run on the cluster, the LL software should be updated before running the rolling update process for your cluster. You should do this manually following the documented LoadLeveler procedures:
See the LoadLeveler documentation for more information.
Updating infrastructure nodes that require a node or CEC reboot is more complicated because the nodes and cluster-wide services that depend on them must be considered in the update algorithm.
Update groups must be defined to encompass a complete CEC. When an update group is being updated, the cluster-wide services that need to be maintained during the update are:
Therefore, you should use separate mutex stanzas in your xCAT rollupdate command input to define the following mutual exclusion sets:
RESTRICTION NOTE: For LoadLeveler servers and their backups that may be running on service nodes in your cluster, xCAT rolling updates currently MUST NOT include any CECS containing nodes actively running as the LL central manager, resource manager, region managers, or schedd servers. Coordinating automatic updates and reboots of these servers while still maintaining full LoadLeveler operation for jobs queued and running across the cluster is very complex. It may be possible to provide rollupdate prescripts to automatically move these services to backups during the update process, but that has not been tested, and the stability of LoadLeveler during such an update has not been verified. At a minimum, the xcatd daemon on the xCAT management node requires continuous contact with the LL central manager to perform the rolling update operations and will not be able to tolerate outages while the central manager migrates to its backup server.
All CECS with xCAT service nodes should be updated before updating the other CECs. This will ensure that the service node will be available when a compute node reboots and needs to load a new OS image. Therefore, two separate xCAT rollupdates should be performed - the first one for the service node CECs only, and the second one for all the other CECs in the cluster. When updating a service node CEC, make sure to set the bringup_order such that the service node is activated before trying to bring up other nodes in that CEC.
Use the prescript and prescriptnodes stanzas to define tasks that should be run to move critical services to their backup servers before shutting down those nodes during an update.
This section contains random hints and tips for working with the xCAT Rolling Update support.
NOTE: This section is a on-going "work in progress". As you gain experience with the xCAT Rolling Update support, feel free to update this section with your favorite tidbits of information that may help others.
You have run your rollupdate command and submitted the reservations to LoadLeveler. How do you know things are really working? Here are a few hints:
If you have run rollupdate with the --verbose option, watch the log output:
tail -F /var/log/xcat/rollupdate.log
You should see entries appear when xcatd receives a callback invocation from the activation of a LoadLeveler reservation and details of processing that update group.
View the RollUpdate appstatus value for your nodes:
lsdef <noderange> -i appstatus,appstatustime
The rollupdate process will change the appstatus value for the "RollingUpdate" application. The appstatustime is set to the last time ANY application changed the appstatus value, so it may not accurately reflect the last time appstatus was changed for rolling updates. appstatus values include:
update_job_submitted: the LL reservation has been submitted, but xcatd has not received a callback invocation for this node yet
ERROR_bringuptimeout_exceeded_for_previous_node: You have specified a bringuporder for nodes in this update group. This node was waiting for a node higher in the bringuporder to complete its bringup process. However, that node had an ERROR_bringuptimeout_exceeded failure, so the bringup process for this node was cancelled. No 'rpower on' command was sent to this node to attempt to bring it up. It is still powered down.
Use LoadLeveler status and query commands to view the state of your nodes and your flexible reservations. Some commands that are useful:
Verify that your LoadLeveler reservations will eventually be activated once nodes become available. Refer to the LoadLeveler product documentation for help in figuring this out. This may take some debug and investigation work on your part and is outside the scope of this document.
Note that based on your rollupdate mutex, nodegroup_mutex, maxupdates, and updateall input stanza entries, xCAT sets LL FLOATING_RESOURCES, SCHEDULE_BY_RESOURCES, and machine FEATURE values in your LL database that are used by the reservation job command file that is submitted to LL. Errors in these entries may prevent your reservations from activating.
The xCAT rolling update process is tightly integrated with the LoadLeveler Flexible Reservation support. Based on your input to the rollupdate command, xCAT generates and submits LL flexible reservations, and then relies on LL to notify xcatd when the reservation becomes active so that the nodes can be updated.
As input, the rollupdate command uses the LL job command template file you specify in the jobtemplate input stanza. xCAT provides sample template files in /opt/xcat/share/xcat/rollupdate. When customizing this template, be sure to keep the substitution strings created by xCAT, especially those for node counts, Feature requirements, and step_resources. xCAT relies on these to control the proper activation of reservations based on your input specifications.
The rollupdate command will use the input stanzas and this template to generate multiple files. They will be placed in the directory you specified with the jobdir stanza. Files that are generated include (<my_rollupdate_input> | rollupdate --verbose --test
While the xCAT rolling update is in process, and depending on the nature of the updates being made, it may be important to control user jobs such that they only run on all old nodes or on all updated nodes. There is nothing inherent in the xCAT rolling update process or in LoadLeveler that will control this for you.
However, you may be able to take advantage of LL machine FEATURE definitions and cluster job filters to help you with this issue. One possible process to control your jobs may be as follows:
1. Add a value to all of your machine definition FEATURE attributes to indicate the nodes are in an "old" state.
2. Create a job filter so that all jobs that are submitted will require this FEATURE value.
3. Submit your xCAT rolling update process specifying the "old_feature" and "new_feature" stanzas.
4. As xCAT updates nodes, it will remove the "old_feature" value and replace it with the "new_feature" value in those LL machine defintions.
5. At some point when a significant number of nodes in your cluster have been updated, you can change your job filter to now require machines with FEATUREs set to the new_feature value.
6. Since the cluster availablity of FEATURE requirements are determined at the time a job is submitted, it may be that some jobs requiring the "old_feature" may no longer have available resources by the time that job gets to run. At this point, you can either cancel the jobs and have them be re-submitted, or once all of the nodes have been updated, again set all the LL machine definition FEATUREs to have both "old_feature" and "new_feature" values to allow those jobs to run.
7. Once all the cluster nodes have been updated, you can remove your job filter changes made for this rolling update process.
8. Once all of the jobs that have been submitted using one of the changed filters has completed running, you can remove both the "old_feature" and "new_feature" values from your machine definitions.
Wiki: AIX_System_Migration
Wiki: Granting_Users_xCAT_privileges
Wiki: Power_775_Cluster_Documentation
Wiki: Setting_Up_a_Linux_Hierarchical_Cluster
Wiki: Setting_Up_a_Linux_xCAT_Mgmt_Node
Wiki: Setting_Up_an_AIX_Hierarchical_Cluster
Wiki: Setting_up_the_IBM_HPC_Stack_in_an_xCAT_Cluster
Wiki: Updating_AIX_Software_on_xCAT_Nodes
Wiki: XCAT_2.6.6_Release_Notes