Menu

#1784 repeat update same nodes in updateall method

2.6
closed
5
2012-09-19
2011-03-04
No

I found sometimes it will repeat to update same nodes in one time rolling update using updateall method.

The environment is 6 compute nodes. maxupdates=2 and updateall_nodecount=2. So 2 reservations will be active firstly, and at same time 1 reservation is waiting. As following case:

hv32s5fp22/23/24/30 was updating and completed firstly.
And when the third reservation is active, it tried to update hv32s5fp22/23 again. While the appstatus is update_complete,
so rollupdate threw out error:Node hv32s5fp22/23 appstatus not in valid state for rolling update, then exit. Finally, hv32s5fp21/25 didn't be updated. For convienence to trace the appstaus changes, I printed some messages in xCAT::Utils::setAppStatus.

Thu Mar 3 21:06:48 2011 Running rollupdate command...
Thu Mar 3 21:06:48 2011 Creating LL job command files
Thu Mar 3 21:06:57 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'
Thu Mar 3 21:07:00 2011 Reading LL job template file /opt/xcat/share/xcat/rollupdate/llall.tmpl
Thu Mar 3 21:07:00 2011 Running command: llstatus -r %n %sta 2>/dev/null
Thu Mar 3 21:07:01 2011 Writing xCAT rolling update data file /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data
Thu Mar 3 21:07:01 2011 Writing LL reservation callback script /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb
Thu Mar 3 21:07:01 2011 Writing LL job command file /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd
Thu Mar 3 21:07:01 2011 Running command 'llconfig -h hv32s5fp21.cluster.net hv32s5fp22.cluster.net hv32s5fp23.cluster.net hv32s5fp24.cluster.net hv32s5fp25.cluster.net hv32s5fp30.cluster.net -d FEATURE'
Thu Mar 3 21:07:04 2011 Return code: 0
Thu Mar 3 21:07:04 2011 Running command 'llconfig -N -h hv32s5fp21.cluster.net hv32s5fp22.cluster.net hv32s5fp23.cluster.net hv32s5fp24.cluster.net hv32s5fp25.cluster.net hv32s5fp30.cluster.net -c FEATURE=" newvalue XCAT_UPDATEALL1299208008"'
Thu Mar 3 21:07:07 2011 Return code: 0
Thu Mar 3 21:07:10 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'

set hv32s5fp21 hv32s5fp22 hv32s5fp23 hv32s5fp24 hv32s5fp25 hv32s5fp30 appstatus RollingUpdate=update_job_submitted Thu Mar 3 21:07:13 2011 Running command: su - loadl "-c llmkres -x -d 15 -f /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd -p /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb "
[YOU HAVE NEW MAIL]llmkres: The reservation hv32s5fp03.cluster.net.99.r has been successfully made.llmkres: The job "hv32s5fp03.cluster.net.99" has been submitted.Thu Mar 3 21:07:17 2011runrollupdate request for loadleveler /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data hv32s5fp03.cluster.net.99.r
Thu Mar 3 21:07:17 2011 runrollupdate reading datafile /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data
Thu Mar 3 21:07:17 2011 UPDATEALL1299208008: Running command 'llqres -r -s -R hv32s5fp03.cluster.net.99.r 2>>/var/log/xcat/rollupdate.log'
[YOU HAVE NEW MAIL]llmkres: The reservation hv32s5fp03.cluster.net.100.r has been successfully made.llmkres: The job "hv32s5fp03.cluster.net.100" has been submitted.Thu Mar 3 21:07:19 2011 Return code: 0
Thu Mar 3 21:07:19 2011 hv32s5fp03.99.r!FLEXIBLE!Thu Mar 3 21:07:16 2011!loadl!No_Group!Thu Mar 3 21:07:16 2011!15!Thu Mar 3 21:22:16 2011!hv32s5fp03.cluster.net.99.0!no!no!firm!ACTIVE!loadl!Thu Mar 3 21:07:16 2011!0!!0!!/u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb!!2!hv32s5fp22,hv32s5fp23!XCATROLLINGUPDATE_MAXUPDATES(1/1/1)!0!
Thu Mar 3 21:07:19 2011 Hostlist: hv32s5fp22,hv32s5fp23
Thu Mar 3 21:07:19 2011runrollupdate request for loadleveler /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data hv32s5fp03.cluster.net.100.r

set hv32s5fp22 hv32s5fp23 appstatus RollingUpdate=running_prescripts Thu Mar 3 21:07:19 2011 runrollupdate reading datafile /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data
Thu Mar 3 21:07:19 2011 UPDATEALL1299208008: Running command 'llqres -r -s -R hv32s5fp03.cluster.net.100.r 2>>/var/log/xcat/rollupdate.log'

set hv32s5fp22 hv32s5fp23 appstatus RollingUpdate=shutting_down Thu Mar 3 21:07:19 2011 UPDATEALL1299208008: Running command 'xdsh hv32s5fp22,hv32s5fp23 -v shutdown -F &'
[YOU HAVE NEW MAIL]llmkres: The reservation hv32s5fp03.cluster.net.101.r has been successfully made.llmkres: The job "hv32s5fp03.cluster.net.101" has been submitted.Thu Mar 3 21:07:21 2011 Return code: 0
Thu Mar 3 21:07:21 2011 hv32s5fp03.100.r!FLEXIBLE!Thu Mar 3 21:07:17 2011!loadl!No_Group!Thu Mar 3 21:07:17 2011!15!Thu Mar 3 21:22:17 2011!hv32s5fp03.cluster.net.100.0!no!no!firm!ACTIVE!loadl!Thu Mar 3 21:07:17 2011!0!!0!!/u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb!!2!hv32s5fp24,hv32s5fp30!XCATROLLINGUPDATE_MAXUPDATES(1/1/1)!0!
Thu Mar 3 21:07:21 2011 Hostlist: hv32s5fp24,hv32s5fp30

set hv32s5fp24 hv32s5fp30 appstatus RollingUpdate=running_prescripts

set hv32s5fp24 hv32s5fp30 appstatus RollingUpdate=shutting_down
Thu Mar 3 21:07:21 2011 UPDATEALL1299208008: Running command 'xdsh hv32s5fp24,hv32s5fp30 -v shutdown -F &'
Thu Mar 3 21:07:39 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp22,hv32s5fp23 stat'
Thu Mar 3 21:07:41 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp24,hv32s5fp30 stat'
Thu Mar 3 21:08:01 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp22,hv32s5fp23 stat'
Thu Mar 3 21:08:03 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp24,hv32s5fp30 stat'
Thu Mar 3 21:08:22 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp22,hv32s5fp23 stat'
Thu Mar 3 21:08:24 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp24,hv32s5fp30 stat'
Thu Mar 3 21:08:44 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp22,hv32s5fp23 stat'

set hv32s5fp22 hv32s5fp23 appstatus RollingUpdate=running_outofbandcmds

set hv32s5fp22 hv32s5fp23 appstatus RollingUpdate=rebooting
Thu Mar 3 21:08:45 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp22,hv32s5fp23 on'
Thu Mar 3 21:08:46 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp24,hv32s5fp30 stat'

set hv32s5fp24 hv32s5fp30 appstatus RollingUpdate=running_outofbandcmds

set hv32s5fp24 hv32s5fp30 appstatus RollingUpdate=rebooting
Thu Mar 3 21:08:47 2011 UPDATEALL1299208008: Running command 'rpower hv32s5fp24,hv32s5fp30 on'
Thu Mar 3 21:08:55 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:08:59 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:09:15 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:09:19 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:09:35 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:09:39 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:09:55 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:09:59 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:10:15 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:10:19 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:10:35 2011 UPDATEALL1299208008: Checking hv32s5fp22,hv32s5fp23 xCAT database status for value booted
Thu Mar 3 21:10:35 2011 UPDATEALL1299208008: remove_LL_reservations for hv32s5fp23,hv32s5fp22
Thu Mar 3 21:10:35 2011 UPDATEALL1299208008: Running command 'llqres -r -s -R hv32s5fp03.cluster.net.99.r 2>>/var/log/xcat/rollupdate.log'
Thu Mar 3 21:10:35 2011 Return code: 0
Thu Mar 3 21:10:35 2011 hv32s5fp03.99.r!FLEXIBLE!Thu Mar 3 21:07:16 2011!loadl!No_Group!Thu Mar 3 21:07:16 2011!15!Thu Mar 3 21:22:16 2011!hv32s5fp03.cluster.net.99.0!no!no!firm!ACTIVE!loadl!Thu Mar 3 21:07:16 2011!0!!0!!/u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb!!2!hv32s5fp22,hv32s5fp23!XCATROLLINGUPDATE_MAXUPDATES(1/1/1)!0!
Thu Mar 3 21:10:35 2011 UPDATEALL1299208008: Running command 'llconfig -h hv32s5fp23 -d FEATURE'
Thu Mar 3 21:10:38 2011 Return code: 0
Thu Mar 3 21:10:38 2011 UPDATEALL1299208008: Running command 'llconfig -N -h hv32s5fp23 -c FEATURE=" newvalue "'
Thu Mar 3 21:10:39 2011 UPDATEALL1299208008: Checking hv32s5fp24,hv32s5fp30 xCAT database status for value booted
Thu Mar 3 21:10:39 2011 UPDATEALL1299208008: remove_LL_reservations for hv32s5fp30,hv32s5fp24
Thu Mar 3 21:10:39 2011 UPDATEALL1299208008: Running command 'llqres -r -s -R hv32s5fp03.cluster.net.100.r 2>>/var/log/xcat/rollupdate.log'
Thu Mar 3 21:10:40 2011 Return code: 0
Thu Mar 3 21:10:40 2011 hv32s5fp03.100.r!FLEXIBLE!Thu Mar 3 21:07:17 2011!loadl!No_Group!Thu Mar 3 21:07:17 2011!15!Thu Mar 3 21:22:17 2011!hv32s5fp03.cluster.net.100.0!no!no!firm!ACTIVE!loadl!Thu Mar 3 21:07:17 2011!0!!0!!/u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb!!2!hv32s5fp24,hv32s5fp30!XCATROLLINGUPDATE_MAXUPDATES(1/1/1)!0!
Thu Mar 3 21:10:40 2011 UPDATEALL1299208008: Running command 'llconfig -h hv32s5fp30 -d FEATURE'
Thu Mar 3 21:10:42 2011 Return code: 0
Thu Mar 3 21:10:44 2011 Return code: 0
Thu Mar 3 21:10:44 2011 UPDATEALL1299208008: Running command 'llconfig -N -h hv32s5fp30 -c FEATURE=" newvalue "'
Thu Mar 3 21:10:45 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'
Thu Mar 3 21:10:47 2011 Return code: 0
Thu Mar 3 21:10:50 2011 UPDATEALL1299208008: Running command 'llconfig -h hv32s5fp22 -d FEATURE'
Thu Mar 3 21:10:51 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'
Thu Mar 3 21:10:54 2011 Return code: 0
Thu Mar 3 21:10:54 2011 UPDATEALL1299208008: Running command 'llconfig -N -h hv32s5fp22 -c FEATURE=" newvalue "'
Thu Mar 3 21:10:58 2011 UPDATEALL1299208008: Running command 'llconfig -h hv32s5fp24 -d FEATURE'
Thu Mar 3 21:10:58 2011 Return code: 0
Thu Mar 3 21:11:02 2011 Return code: 0
Thu Mar 3 21:11:02 2011 UPDATEALL1299208008: Running command 'llconfig -N -h hv32s5fp24 -c FEATURE=" newvalue "'
Thu Mar 3 21:11:03 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'
Thu Mar 3 21:11:06 2011 Return code: 0
Thu Mar 3 21:11:08 2011 Running command 'llstatus -Lmachine -h hv32s5fp23 hv32s5fp22 -l | grep -i feature | grep -i " XCAT_UPDATEALL1299208008 "'
Thu Mar 3 21:11:09 2011 Return code: 1
Thu Mar 3 21:11:09 2011 UPDATEALL1299208008: Running command 'llrmres -R hv32s5fp03.cluster.net.99.r'
Thu Mar 3 21:11:10 2011 Running local command 'export EXTSHM=ON;llrctl reconfig'

set hv32s5fp23 hv32s5fp22 appstatus RollingUpdate=update_complete Thu Mar 3 21:11:11 2011 UPDATEALL1299208008: Rolling update complete.

Thu Mar 3 21:11:13 2011runrollupdate request for loadleveler /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data hv32s5fp03.cluster.net.101.r
Thu Mar 3 21:11:13 2011 runrollupdate reading datafile /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data
Thu Mar 3 21:11:13 2011 UPDATEALL1299208008: Running command 'llqres -r -s -R hv32s5fp03.cluster.net.101.r 2>>/var/log/xcat/rollupdate.log'
Thu Mar 3 21:11:14 2011 Return code: 0
Thu Mar 3 21:11:14 2011 hv32s5fp03.101.r!FLEXIBLE!Thu Mar 3 21:07:20 2011!loadl!No_Group!Thu Mar 3 21:11:11 2011!15!Thu Mar 3 21:26:11 2011!hv32s5fp03.cluster.net.101.0!no!no!firm!ACTIVE!loadl!Thu Mar 3 21:07:20 2011!0!!0!!/u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb!!2!hv32s5fp22,hv32s5fp23!XCATROLLINGUPDATE_MAXUPDATES(1/1/1)!0!
Thu Mar 3 21:11:14 2011 Hostlist: hv32s5fp22,hv32s5fp23
Thu Mar 3 21:11:14 2011 UPDATEALL1299208008: Node hv32s5fp22 appstatus not in valid state for rolling update
The following nodelist will not be processed:
hv32s5fp22,hv32s5fp23
Thu Mar 3 21:11:18 2011 Running command 'llstatus -Lmachine -h hv32s5fp30 hv32s5fp24 -l | grep -i feature | grep -i " XCAT_UPDATEALL1299208008 "'
Thu Mar 3 21:11:19 2011 Return code: 1
Thu Mar 3 21:11:19 2011 UPDATEALL1299208008: Running command 'llrmres -R hv32s5fp03.cluster.net.100.r'

set hv32s5fp30 hv32s5fp24 appstatus RollingUpdate=update_complete Thu Mar 3 21:11:20 2011 UPDATEALL1299208008: Rolling update complete.

bash-3.2# cat /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.*

!/bin/sh

Sample job command template file used to generate cluster rolling update

jobs that will be submitted to LoadLeveler.

Use this template with the Rolling Update "update_all" feature

It only uses a node count and not a specific hostlist

xCAT will substitute the following when creating the LL job command files:

UPDATEALL1299208008 - the update group name for the nodes in this reservation

/u/loadl/rollupdate_jobs - the directory specified in the rollupdate input stanza

jobdir entry

2 - REQUIRED - used by xCAT to set the number of machines to

reserve

XCAT_UPDATEALL1299208008 - REQUIRED - used by xCAT to control the rolling update

XCATROLLINGUPDATE_MAXUPDATES(1) - the resources xCAT created for max_updates

@ job_name = rollupdate_UPDATEALL1299208008

@ job_type = parallel

@ node_usage = not_shared

@ restart = no

@ error = /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.$(Host).$(Cluster).$(Process).err

@ output = /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.$(Host).$(Cluster).$(Process).out

@ node = 2

@ requirements = (Feature =="XCAT_UPDATEALL1299208008")

@ step_resources = XCATROLLINGUPDATE_MAXUPDATES(1)

@ queue

xCAT Rolling Update data file for update group UPDATEALL1299208008

updategroup=UPDATEALL1299208008
updatefeature=XCAT_UPDATEALL1299208008
oldfeature=oldvalue
newfeature=newvalue

shutdowntimeout=5

bringupstatus=booted
bringuptimeout=10

!/bin/sh

LL Reservation Callback script for xCAT Rolling Update group UPDATEALL1299208008

if [ "$2" == "RESERVATION_ACTIVE" ] ; then
/opt/xcat/bin/runrollupdate --verbose loadleveler /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data $1 &
fi

bash-3.2# cat /test/rollupdate_all.input | rollupdate -V
Running rollupdate command...
Creating LL job command files
Running command on hv32s5fp03: llconfig -d FLOATING_RESOURCES SCHEDULE_BY_RESOURCES CENTRAL_MANAGER_LIST RESOURCE_MGR_LIST 2>&1

Running command on hv32s5fp03: llconfig -N -c FLOATING_RESOURCES=" XCATROLLINGUPDATE_MAXUPDATES(2) " SCHEDULE_BY_RESOURCES=" XCATROLLINGUPDATE_MAXUPDATES " 2>&1

Running command on hv32s5fp03: llconfig -d CENTRAL_MANAGER_LIST RESOURCE_MGR_LIST 2>&1

Running command on hv32s5fp03: export EXTSHM=ON;llrctl reconfig 2>&1

Reading LL job template file /opt/xcat/share/xcat/rollupdate/llall.tmpl
Running command: llstatus -r %n %sta 2>/dev/null
Running command on hv32s5fp03: llstatus -r %n %sta 2>/dev/null 2>&1

Writing xCAT rolling update data file /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.data
Writing LL reservation callback script /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb
Writing LL job command file /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd
Running command on hv32s5fp03: llconfig -h hv32s5fp21.cluster.net hv32s5fp22.cluster.net hv32s5fp23.cluster.net hv32s5fp24.cluster.net hv32s5fp25.cluster.net hv32s5fp30.cluster.net -d FEATURE 2>&1

Running command on hv32s5fp03: llconfig -N -h hv32s5fp21.cluster.net hv32s5fp22.cluster.net hv32s5fp23.cluster.net hv32s5fp24.cluster.net hv32s5fp25.cluster.net hv32s5fp30.cluster.net -c FEATURE=" newvalue XCAT_UPDATEALL1299208008" 2>&1

Running command on hv32s5fp03: llconfig -d CENTRAL_MANAGER_LIST RESOURCE_MGR_LIST 2>&1

Running command on hv32s5fp03: export EXTSHM=ON;llrctl reconfig 2>&1

Running command: su - loadl "-c llmkres -x -d 15 -f /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd -p /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb "
Running command on hv32s5fp03: su - loadl "-c llmkres -x -d 15 -f /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd -p /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb " 2>&1

llmkres: The reservation hv32s5fp03.cluster.net.99.r has been successfully made.
llmkres: The job "hv32s5fp03.cluster.net.99" has been submitted.

Running command on hv32s5fp03: su - loadl "-c llmkres -x -d 15 -f /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd -p /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb " 2>&1

llmkres: The reservation hv32s5fp03.cluster.net.100.r has been successfully made.
llmkres: The job "hv32s5fp03.cluster.net.100" has been submitted.

Running command on hv32s5fp03: su - loadl "-c llmkres -x -d 15 -f /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.cmd -p /u/loadl/rollupdate_jobs/rollupdate_UPDATEALL1299208008.rsvcb " 2>&1

llmkres: The reservation hv32s5fp03.cluster.net.101.r has been successfully made.
llmkres: The job "hv32s5fp03.cluster.net.101" has been submitted.

Discussion

  • Anonymous

    Anonymous - 2011-03-14

    Ai Rong - I don't know if this is related to the ENV setting for EXTSHM=ON or not that we have discussed in various emails. I also saw some issues in my RH6 cluster where LL and xCAT were getting confused about short vs. long hostnames when referring to hostnames in LL. I need to investigate this issue some more.

     
  • Anonymous

    Anonymous - 2011-03-15

    I made some changes to rollupdate.pm to use LL machine names in the llconfig -h commands. I was having problems with LL short vs. long machine names, and changed to code to use the same names as known by LL.

    This fixed the problem on my test cluster. Not sure if you were seeing the same issue. Reopen this defect if you are still having problems and I will investigate further.

    Change in SVN revision 9057,

     
  • SourceForge Robot

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 28 days (the time period specified by
    the administrator of this Tracker).