#2838 HASN mkdsklsnode -b shared root nim resource not defined

2.7.3
closed
Norm Nott
7
2012-09-19
2012-05-14
Mark Perez
No

When running initial mkdsklsnode -b to multiple service nodes in an HA/SN environment, xcat does not succesfully create the shared root NIM object on all of the service nodes. There are a couple messages in the ouput of the command:
mkdsklsnode -V -S -b -i 71Ddskls_CSP5_1 compute,storage
that indicate the shared root is not defined on a service node:

Error: c250f12c10ap01: Missing required information for node 'c250f12c12ap29-hf0'.
Running command on c250f12c10ap01: /usr/sbin/lsnim -a location -Z 71Ddskls_CSP5_1_shared_root 2>/dev/null 2>&1
R unning command on c250f12c10ap01: /usr/bin/cp /etc/hosts /etc/.client_data/hosts.c250f12c12ap29-hf0 2>/dev/null 2>&1
Error: Could not copy /etc/hosts to /etc/.client_data/hosts.c250f12c12ap29-hf0.

At completion of command, lsnim on the shared root object does not return information on the complaining service node.

xdsh service lsnim | grep shared
Verify the new shared root is defined on all service nodes.

Work around is to run the command to a nodegroup that targets each service node one at time. After the first run, subseuent runs can use the "-k" option to speed the process up:

nodels service
c250f12c10ap01
c250f12c12ap01
mkdsklsnode -V -S -b -i 71Ddskls_CSP5_1 SN0group
mkdsklsnode -k -V -S -b -i 71Ddskls_CSP5_1 SN12group

Discussion

  • Norm Nott
    Norm Nott
    2012-05-21

    added locking

     
  • Norm Nott
    Norm Nott
    2012-05-23

    This is most likely due to a timing issue between the two SNs.

    The locking code that was added is not quite right and will be fixed.

     
  • Norm Nott
    Norm Nott
    2012-05-24

    Done

    file: aixinstall.pm

    2.7.2 - r12922.
    2.8 - r12923

     
  • yan feng han
    yan feng han
    2012-06-02

    Hi, I recreated this defect on f12 today, the shared_root on one SN - c250f12c12ap01 wasn't created after mkdsklsnode -b command for all compute nodes, the command was:

    mkdsklsnode -V -S -b -i 71Ddskls_CSP5_MCR3 compute configdump=selective 2>&1 | tee /tmp/mkdsklsnode.compute.b.out &

    the error logs from /tmp/mkdsklsnode.compute.b.out were:

    Error: c250f12c12ap01: Could not initialize NIM client named 'c250f12c06ap05-hf0'.

    Error: 0042-001 nim: processing error encountered on "master":
    0042-124 c_ch_nfsexp: NFS option noauto is NOT supported

    rc=53
    0042-053 m_dkls_inst: there is no NIM object named "71Ddskls_CSP5_MCR3_shared_root"

    Please check and fix it.

     
  • Bruce
    Bruce
    2012-06-07

    Yan Feng,

    In all your bug posts, please tell us exactly what version of xcat you are running. In this case was it 2.7.2 (build date 5/23) or earlier? If so, you just missed norms fix. Norm checked this fix in with revision 12922, but the last revision that the 2.7.2 build picked up is 12895. Let us know what version you were running.

     
  • Brian  Croswell
    Brian Croswell
    2012-06-08

    Bruce, I worked with Scot and Norm last weekend with this issue.
    We need Norm to sync in with Yan Fang and Mark Perez to see if this was
    a configuration issue or valid recreate.
    Note from 6/2 ..
    Scott, Team,

    I took a look at the frame 12 cluster checking with xCAT MN and 2 xCAT SNs .
    These nodes do have the latest aixinstall.pm efix installed that should include a fix for defect 3526650 .

    We will need Norm to take a look at the log files and the NIM environment on the cluster to see what caused this recreate.
    I do not think this is a stop ship issue and Norm should look at this cluster when he returns from vacation on Monday.
    If Scott and Han Yan think this defect is a blocking issue for xCAT, please let us know and we will need to make a call out.

     
  • yan feng han
    yan feng han
    2012-06-11

    I think it is not a block issue too, because we have workaround for it. But I suggest we add some commons to the guide or Service Pack README for the workaround.

     
  • Norm Nott
    Norm Nott
    2012-06-14

    I just checked the aixinstall.pm file and it does have the locking code which should fix this bug. The locking code makes sure the two instances of NIM do not interfere with each other and seemed to fix the issue on my test system.

    I'm not sure from the notes whether this bug was reproduced with the fixed code or not.

    Could someone please clarify the state of this bug?

     
  • Norm Nott
    Norm Nott
    2012-06-28

    fixes for mkdsklsnode/rmdsklsnode

     
    Attachments
  • Norm Nott
    Norm Nott
    2012-07-09

    contains emgr package and README

     
    Attachments