#2843 HASN:disconnect received from SN running xdsh commands.

2.7.2
closed
HA-SN (21)
7
2014-03-06
2012-05-15
No

The cluster is P7 IH, frame 12, the EMS is c250mgrs27-pvt, it is a cluster with AIX with HA SN configured.

The xcat is 2.7.2 0504 build:
[c250mgrs27-pvt][/]> rpm -qa|grep -i xcat
perl-xCAT-2.7.2-snap201205030303
openslp-xcat-1.2.1-1
xCAT-dfm-2.7.0-13
xCAT-IBMhpc-2.7.2-snap201205030304
xCAT-2.7.2-snap201205030304
xCAT-client-2.7.2-snap201205030303
xCAT-rmc-2.7.2-snap201205011649
xCAT-server-2.7.2-snap201205030303

The two SNs are:
[c250mgrs27-pvt][/]> nodels service
c250f12c10ap01
c250f12c12ap01

The compute are split to two nodegroups, SN10group using c250f12c10ap01 as primary SN and c250f12c12ap01 as backup SN, and SN12group using c250f12c12ap01 as primary SN and c250f12c10ap01 as backup SN.

[c250mgrs27-pvt][/]> nodels SN10group
c250f12c06ap01-hf0
c250f12c06ap05-hf0
c250f12c06ap09-hf0
c250f12c06ap13-hf0
c250f12c06ap17-hf0
c250f12c06ap21-hf0
c250f12c06ap25-hf0
c250f12c07ap01-hf0
c250f12c07ap05-hf0
c250f12c07ap09-hf0
c250f12c07ap13-hf0
c250f12c07ap17-hf0
c250f12c07ap21-hf0
c250f12c07ap25-hf0
c250f12c07ap29-hf0
c250f12c08ap01-hf0
c250f12c08ap05-hf0
c250f12c08ap09-hf0
c250f12c08ap13-hf0
c250f12c08ap17-hf0
c250f12c08ap21-hf0
c250f12c08ap25-hf0
c250f12c08ap29-hf0
c250f12c09ap01-hf0
c250f12c09ap05-hf0
c250f12c09ap09-hf0
c250f12c09ap13-hf0
c250f12c09ap17-hf0
c250f12c09ap21-hf0
c250f12c09ap25-hf0
c250f12c09ap29-hf0
c250f12c10ap05-hf0
c250f12c10ap09-hf0
c250f12c10ap13-hf0
c250f12c10ap17-hf0
c250f12c10ap21-hf0
c250f12c10ap25-hf0
c250f12c10ap29-hf0
c250f12c11ap01-hf0
c250f12c11ap05-hf0
c250f12c11ap09-hf0
c250f12c11ap13-hf0
c250f12c11ap17-hf0
c250f12c11ap21-hf0
c250f12c11ap25-hf0
c250f12c11ap29-hf0
[c250mgrs27-pvt][/]> nodels SN12group
c250f12c01ap01-hf0
c250f12c01ap05-hf0
c250f12c01ap09-hf0
c250f12c01ap13-hf0
c250f12c01ap17-hf0
c250f12c01ap21-hf0
c250f12c01ap25-hf0
c250f12c01ap29-hf0
c250f12c02ap01-hf0
c250f12c02ap05-hf0
c250f12c02ap09-hf0
c250f12c02ap13-hf0
c250f12c02ap17-hf0
c250f12c02ap21-hf0
c250f12c02ap25-hf0
c250f12c02ap29-hf0
c250f12c03ap01-hf0
c250f12c03ap05-hf0
c250f12c03ap09-hf0
c250f12c03ap13-hf0
c250f12c03ap17-hf0
c250f12c03ap21-hf0
c250f12c03ap25-hf0
c250f12c03ap29-hf0
c250f12c04ap01-hf0
c250f12c04ap05-hf0
c250f12c04ap09-hf0
c250f12c04ap13-hf0
c250f12c04ap17-hf0
c250f12c04ap21-hf0
c250f12c04ap25-hf0
c250f12c05ap01-hf0
c250f12c05ap05-hf0
c250f12c05ap09-hf0
c250f12c05ap13-hf0
c250f12c05ap17-hf0
c250f12c05ap21-hf0
c250f12c05ap25-hf0
c250f12c05ap29-hf0
c250f12c12ap05-hf0
c250f12c12ap09-hf0
c250f12c12ap13-hf0
c250f12c12ap17-hf0
c250f12c12ap21-hf0
c250f12c12ap25-hf0
c250f12c12ap29-hf0

I hit this problem when I tried to check the lpps on the compute nodes, I ran the command below some times but got different outputs. I listed three different output here, and the last one was what I expected.

[c250mgrs27-pvt][/]> xdsh compute "lslpp -L|egrep -i 'ppe|LoadL|gpfs|hfi|ml'|sort"|xcoll
ERROR/WARNING: communication with the xCAT server seems to have been ended prematurely

[c250mgrs27-pvt][/]> xdsh compute "lslpp -L|egrep -i 'ppe|LoadL|gpfs|hfi|ml'|sort"|xcoll
ERROR/WARNING: communication with the xCAT server seems to have been ended prematurely
====================================
c250f12c06ap17-hf0
====================================
(HFI) Runtime
(HFI) files
Copper SFP+ 10GbE Adapter
Copper SFP+ 10GbE Adapter
1.1.0.0 C F HFI Rte Msgs - U.S. English
1.1.0.0 C F HFI Runtime Messages - U.S.
5.1.0.0 C F LoadLeveler License Fileset
7.1.0.0 C F PCIe2 2-port 10GbE SFP Copper
7.1.1.0 C F PCIe2 2-port 10GbE SFP+Copper
LoadL.resmgr.full 5.1.0.6 A F LoadLeveler
LoadL.resmgr.loc.license 5.1.0.0 C F LoadLeveler License Fileset
LoadL.resmgr.msg.en_US 5.1.0.4 C F LoadLeveler Messages - U.S.
LoadL.scheduler.full 5.1.0.6 A F LoadLeveler
LoadL.scheduler.loc.license
LoadL.scheduler.msg.en_US 5.1.0.4 C F LoadLeveler Messages - U.S.
LoadL.scheduler.so 5.1.0.6 A F LoadLeveler (Submit only)
LoadL.scheduler.webui 5.1.0.0 C F LoadLeveler Web-based User
bos.mls.lib 7.1.0.0 C F Trusted AIX Libraries
bos.rte.mlslib 7.1.1.0 C F Trusted AIX Libraries
devices.chrp.IBM.HFI.rte 1.1.0.4 A F IBM Host Fabric Interface
devices.common.IBM.hfi.rte
devices.common.IBM.ml 1.5.1.3 A F Multi Link Interface Runtime
devices.msg.en_US.chrp.IBM.HFI.rte
devices.msg.en_US.common.IBM.hfi.rte
devices.msg.en_US.common.IBM.ml
expat 2.0.1-3 C R An XML parser library
gpfs.base 3.4.0.13 A F GPFS File Manager
gpfs.docs.data 3.4.0.0 C F GPFS Server Manpages and
gpfs.gnr 3.4.0.4 A F GPFS Native RAID
gpfs.msg.en_US 3.4.0.13 A F GPFS Server Messages - U.S.
libxml2 2.7.8-1 C R Library providing XML and HTML
ppe.loc.license 1.1.0.0 C F Parallel Environment License
ppe.man 1.1.0.6 A F ppe.man IBM PE Runtime Edition
ppe.openshmem 1.0.0.0 C F IBM openshmem library for AIX
ppe.rte 1.1.0.6 A F poe Parallel Operating
ppe.samples 1.1.0.6 A F ppe.samples Parallel
ppe_man aix7.1-1.1 C R Parallel Environment Runtime
ppe_rte aix7.1-1.1.0.0 C R Parallel Environment Runtime
ppe_samples aix7.1-1.1 C R Parallel Environment Runtime
ppedev.hpct 1.1.0.2 A F hpct IBM High Performance
ppedev.loc.license 1.1.0.0 C F Parallel Environment Developer
ppedev.ptp 1.1.0.2 A F ptp Eclipse Parallel Tools
ppedev.ptp.rte 1.1.0.2 A F ptp Eclipse Parallel Tools
ppedev.rte 1.1.0.2 A F hpct IBM High Performance
ppedev_hpct_aix 1.1.0-2 C R IBM HPC Toolkit for PE
ppedev_ptp_aix 1.1.0-2 C R Eclipse Parallel Tools
ppedev_ptp_rte_aix 1.1.0-2 C R PTP proxies for PE Developer
ppedev_runtime_aix 1.1.0-2 C R IBM High Performance Computing
vac.html.common.search 11.1.0.4 A F Supersede entry, not installed
vac.html.en_US.C 11.1.0.4 A F Supersede entry, not installed
vacpp.html.common 11.1.0.4 A F Supersede entry, not installed
vacpp.html.en_US 11.1.0.4 A F Supersede entry, not installed
xlfcmp.html.common 13.1.0.4 A F Supersede entry, not installed
xlfcmp.html.en_US 13.1.0.4 A F Supersede entry, not installed

[c250mgrs27-pvt][/]> xdsh compute "lslpp -L|egrep -i 'ppe|LoadL|gpfs|hfi|ml'|sort"|xcoll

compute

                                               (HFI) Runtime
                                               (HFI) files
                                               Copper SFP+ 10GbE Adapter
                                               Copper SFP+ 10GbE Adapter
                         1.1.0.0    C     F    HFI Rte Msgs - U.S. English
                         1.1.0.0    C     F    HFI Runtime Messages - U.S.
                         5.1.0.0    C     F    LoadLeveler License Fileset
                         7.1.0.0    C     F    PCIe2 2-port 10GbE SFP Copper
                         7.1.1.0    C     F    PCIe2 2-port 10GbE SFP+Copper

LoadL.resmgr.full 5.1.0.6 A F LoadLeveler
LoadL.resmgr.loc.license 5.1.0.0 C F LoadLeveler License Fileset
LoadL.resmgr.msg.en_US 5.1.0.4 C F LoadLeveler Messages - U.S.
LoadL.scheduler.full 5.1.0.6 A F LoadLeveler
LoadL.scheduler.loc.license
LoadL.scheduler.msg.en_US 5.1.0.4 C F LoadLeveler Messages - U.S.
LoadL.scheduler.so 5.1.0.6 A F LoadLeveler (Submit only)
LoadL.scheduler.webui 5.1.0.0 C F LoadLeveler Web-based User
bos.mls.lib 7.1.0.0 C F Trusted AIX Libraries
bos.rte.mlslib 7.1.1.0 C F Trusted AIX Libraries
devices.chrp.IBM.HFI.rte 1.1.0.4 A F IBM Host Fabric Interface
devices.common.IBM.hfi.rte
devices.common.IBM.ml 1.5.1.3 A F Multi Link Interface Runtime
devices.msg.en_US.chrp.IBM.HFI.rte
devices.msg.en_US.common.IBM.hfi.rte
devices.msg.en_US.common.IBM.ml
expat 2.0.1-3 C R An XML parser library
gpfs.base 3.4.0.13 A F GPFS File Manager
gpfs.docs.data 3.4.0.0 C F GPFS Server Manpages and
gpfs.gnr 3.4.0.4 A F GPFS Native RAID
gpfs.msg.en_US 3.4.0.13 A F GPFS Server Messages - U.S.
libxml2 2.7.8-1 C R Library providing XML and HTML
ppe.loc.license 1.1.0.0 C F Parallel Environment License
ppe.man 1.1.0.6 A F ppe.man IBM PE Runtime Edition
ppe.openshmem 1.0.0.0 C F IBM openshmem library for AIX
ppe.rte 1.1.0.6 A F poe Parallel Operating
ppe.samples 1.1.0.6 A F ppe.samples Parallel
ppe_man aix7.1-1.1 C R Parallel Environment Runtime
ppe_rte aix7.1-1.1.0.0 C R Parallel Environment Runtime
ppe_samples aix7.1-1.1 C R Parallel Environment Runtime
ppedev.hpct 1.1.0.2 A F hpct IBM High Performance
ppedev.loc.license 1.1.0.0 C F Parallel Environment Developer
ppedev.ptp 1.1.0.2 A F ptp Eclipse Parallel Tools
ppedev.ptp.rte 1.1.0.2 A F ptp Eclipse Parallel Tools
ppedev.rte 1.1.0.2 A F hpct IBM High Performance
ppedev_hpct_aix 1.1.0-2 C R IBM HPC Toolkit for PE
ppedev_ptp_aix 1.1.0-2 C R Eclipse Parallel Tools
ppedev_ptp_rte_aix 1.1.0-2 C R PTP proxies for PE Developer
ppedev_runtime_aix 1.1.0-2 C R IBM High Performance Computing
vac.html.common.search 11.1.0.4 A F Supersede entry, not installed
vac.html.en_US.C 11.1.0.4 A F Supersede entry, not installed
vacpp.html.common 11.1.0.4 A F Supersede entry, not installed
vacpp.html.en_US 11.1.0.4 A F Supersede entry, not installed
xlfcmp.html.common 13.1.0.4 A F Supersede entry, not installed
xlfcmp.html.en_US 13.1.0.4 A F Supersede entry, not installed

I have never hit this problems on the non -HA SN env, can you check it? Thx.

Discussion

  • Guang Cheng Li

    Guang Cheng Li - 2012-05-15

    Lissa, could you take a look at this xdsh problem in the ST HA SN environment? thx. Yan Feng will send you the login info through email.

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    This is not an xdsh problem , We are getting disconnects when running this command. Possible some sort of time out. Check /var/log/messages on MN
    May 15 07:49:54 c250f12c12ap01 auth|security:info Message forwarded from c250f12c12ap29-hf0: sshd[1704338]: Received disconnect from 20.12.12.1: 11: disconnected by user.
    I can create it just with xdsh compute date. It is just whether the data gets back before the disconnect. The more data the more likely it will not get back before the disconnect.
    I only see the disconnect from 20.12.12.1 which seems to be the problem. I did not see a disconnect from 20.12.10.1

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    When I run nodestat compute, I am also getting all these error
    May 15 08:05:22 c250f12c10ap01 auth|security:info Message forwarded from c250f12c11ap13-hf0: sshd[3670088]: Did not receive identification string from 20.12.10.1
    May 15 08:05:22 c250f12c10ap01 auth|security:info Message forwarded from c250f12c11ap17-hf0: sshd[2949916]: Did not receive identification string from 20.12.10.1
    May 15 08:05:22 c250f12c10ap01 auth|security:info Message forwarded from c250f12c11ap21-hf0: sshd[3277002]: Did not receive identification string from 20.12.10.1
    .
    .
    .
    .

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    So a another example just sending to one node
    May 15 08:11:22 c250f12c10ap01 local4:info Message forwarded from c250f12c10ap01: xCAT: xCAT: Allowing xdsh to c250f12c06ap01-hf0 lslpp -l | grep bos for root from c250mgrs27-pvt
    May 15 08:11:22 c250f12c10ap01 auth|security:info Message forwarded from c250f12c06ap01-hf0: sshd[3670980]: Accepted publickey for root from 20.12.10.1 port 34675 ssh2
    May 15 08:11:22 c250f12c10ap01 auth|security:info Message forwarded from c250f12c06ap01-hf0: sshd[3670980]: Received disconnect from 20.12.10.1: 11: disconnected by user

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    So I can logon the Service Node and create the problem just running from the SN to the node, but it never creates in bypass mode. So I think we are looking at a xcatd daemon issue here.

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    I can create the problem going to one node from a service node with a simple script
    [c250f10c12ap17][/]> xdsh c250f10c12ap29-hf0 /tmp/testls
    ERROR/WARNING: communication with the xCAT server seems to have been ended prematurely

    The script:

    !/bin/sh

    RETRY_LIMIT=300

    i=$RETRY_LIMIT
    while :
    do
    /usr/bin/du /xcatpost
    i=$((i - 1))
    if [ $i -gt 0 ]; then
    echo $i
    else
    break
    fi
    done

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    Cannot create on a Linux system so far.

     
  • Lissa Valletta

    Lissa Valletta - 2012-05-15

    Jarrod checked in new xcatd
    2.7 Committed r12725
    2.8 revision 12724.

     
  • yan feng han

    yan feng han - 2012-05-25

    Verified on 2.7.2 05/23 build with Norm's 05/24 efix, didn't recreated it. Thx.

     
  • Lissa Valletta

    Lissa Valletta - 2012-06-06

    Surrrmay of fix

    I thought it was Client.pm because openssl s_client redirected to a file always produced clean, valid xml.  I was wrong.  Also, while AIX sounds like it is more sensitive, this is the same thing that LRZ was running into I believe.  If the client (whatever it may be) is too busy doing something to pull data off the network socket, the data is lost.
    

    Some may recall a PMR in which customer would see xcat clients hang if they didn't have certs set up instead of an error message. While fixing the client,I realized it highlighted a DoS vulnerability in xCAT and set about fixing it. After finding out select() and perl buffered IO didn't play well together, I ultimately set the client socket to non-blockning and used sysread doing EAGAIN as test for 'data not available yet'. However, I left it in non-blocking and when it got around to 'print' the data back to client, the write() system call ultimately would exit without sending data as the client was too busy to get data.

    For now, I set non_block in xcatd prior to every read loop, and set it back to blocking before going into the large bunch of code that would 'print' without checking for EAGAIN.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks