Menu

#4300 xdsh/nmap intermittent inconsistent status

2.9
closed
Ling
nmap (1)
ubuntu
7
2014-12-10
2014-10-18
James Still
No

Hi all,
The xdsh -v sometimes reports that a node is not responsible even if they are. This is caused by nmap not reporting correct status on Ubuntu 14.04 on ppc64le. Now the following xCAT commands are broken:
pping
nodestat
xdsh -v

I have tried different ways of using nmap and have not found a good solution yet. I also tried to increase the arp cache, it did not work. I'll try more on Monday.

Ling


Dave/Mark, I've asked Ling to look at the xdsh behavior on c656 for the issue that I mentioned to you on one of the scrum calls. Below are two examples of what I'm seeing. 1) xdsh is hanging on what I believe is a "bad node" case (ie, c656f4n15) when it should just bypass it when there's no connection 2) in this case, approximately 30 seconds later using the "-v" option, nodes in the same group that returned uptime results are now returning with "not responding" and then eventually the xdsh hangs on #1 above.

root@c656f2n03:~# xdsh f4 uptime
c656f4n01: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n07: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n08: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n02: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.06, 0.03, 0.05
c656f4n09: 09:55:45 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n10: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n04: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.01, 0.02, 0.05
c656f4n11: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n13: 09:55:47 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n12: 09:55:43 up 1 day, 19:34, 0 users, load average: 0.00, 0.02, 0.05
c656f4n16: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n14: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n03: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.07, 0.04, 0.05
c656f4n15: ssh: connect to host c656f4n15 port 22: No route to host
-----hangs here

^Croot@c656f2n03:~# xdsh f4 -v uptime
Error: c656f4n01 is not responding. No command will be issued to this host.
Error: c656f4n12 is not responding. No command will be issued to this host.
Error: c656f4n15 is not responding. No command will be issued to this host.
Error: c656f4n16 is not responding. No command will be issued to this host.
c656f4n02: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.04, 0.03, 0.05
c656f4n07: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n08: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n09: 09:56:09 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n04: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.02, 0.05
c656f4n03: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.12, 0.05, 0.05
c656f4n10: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n14: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n11: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n13: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.06, 0.03, 0.05

Related

Bugs: #4300

Discussion

  • Lissa Valletta

    Lissa Valletta - 2014-10-20

    I would for now not use -v and just use -t with a short timeout untils this is fixed.

     
    • James Still

      James Still - 2014-10-20

      Lisa, thanks for the suggestion. On c656 cluster (~30 nodes), "-t 4"
      seems to do the trick.

      Give a hoot --- don't reboot

      James E. Still, PMP®
      Office: 845-433-7841 / Cell: 845-389-1712
      HPC Software Development
      Systems & Technology Group, Dept 67LB
      Poughkeepsie, NY

      From: "Lissa Valletta" lissav@users.sf.net
      To: "[xcat:bugs] " 4300@bugs.xcat.p.re.sf.net
      Date: 10/20/2014 09:32 AM
      Subject: [xcat:bugs] #4300 xdsh -v intermittent inconsistent status

      I would for now not use -v and just use -t with a short timeout untils
      this is fixed.

      [bugs:#4300] xdsh -v intermittent inconsistent status
      Status: open
      Milestones: 2.9
      Created: Sat Oct 18, 2014 08:12 PM UTC by James Still
      Last Updated: Sat Oct 18, 2014 08:12 PM UTC
      Owner: Ling
      Hi all,
      The xdsh -v sometimes reports that a node is not responsible even if they
      are. This is caused by nmap not reporting correct status on Ubuntu 14.04
      on ppc64le. Now the following xCAT commands are broken:
      pping
      nodestat
      xdsh -v
      I have tried different ways of using nmap and have not found a good
      solution yet. I also tried to increase the arp cache, it did not work.
      I'll try more on Monday.
      Ling

      Dave/Mark, I've asked Ling to look at the xdsh behavior on c656 for the
      issue that I mentioned to you on one of the scrum calls. Below are two
      examples of what I'm seeing. 1) xdsh is hanging on what I believe is a
      "bad node" case (ie, c656f4n15) when it should just bypass it when there's
      no connection 2) in this case, approximately 30 seconds later using the
      "-v" option, nodes in the same group that returned uptime results are now
      returning with "not responding" and then eventually the xdsh hangs on #1
      above.
      root@c656f2n03:~# xdsh f4 uptime
      c656f4n01: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n07: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n08: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n02: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.06, 0.03,
      0.05
      c656f4n09: 09:55:45 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n10: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n04: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.01, 0.02,
      0.05
      c656f4n11: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n13: 09:55:47 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n12: 09:55:43 up 1 day, 19:34, 0 users, load average: 0.00, 0.02,
      0.05
      c656f4n16: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n14: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n03: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.07, 0.04,
      0.05
      c656f4n15: ssh: connect to host c656f4n15 port 22: No route to host
      -----hangs here
      ^Croot@c656f2n03:~# xdsh f4 -v uptime
      Error: c656f4n01 is not responding. No command will be issued to this
      host.
      Error: c656f4n12 is not responding. No command will be issued to this
      host.
      Error: c656f4n15 is not responding. No command will be issued to this
      host.
      Error: c656f4n16 is not responding. No command will be issued to this
      host.
      c656f4n02: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.04, 0.03,
      0.05
      c656f4n07: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n08: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n09: 09:56:09 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n04: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.02,
      0.05
      c656f4n03: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.12, 0.05,
      0.05
      c656f4n10: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n14: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n11: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
      0.05
      c656f4n13: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.06, 0.03,
      0.05

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/xcat/bugs/4300/
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #4300

  • Ling

    Ling - 2014-10-20

    updatenode is broken. It uses xdsh -v. The un-responding nodes will not get updated.

     
  • Ling

    Ling - 2014-10-21

    We'll add a site table variable called 'nmapoptions' to allow user to specify additional options to the nmap command used. The nmap is used in pping, xdsh -v, updatenode, nodestat , renergy etc. We'd also like to consolidate calling of nmap into one place using this defect.
    Currently, the following files are calling nmap:
    perl-xCAT/xCAT/NetworkUtils.pm
    perl-xCAT/xCAT/PPCenergy.pm
    perl-xCAT/xCAT/SLP.pm
    xCAT-client/bin/pping
    xCAT-server/lib/xcat/plugins/nodestat.pm

     

    Last edit: Ling 2014-10-21
  • Ling

    Ling - 2014-10-21

    nmap sometimes give unstable output because of network response is slow. User can add additional options in nmap to find a good balance between time and performance. For example, increase the minimum timeout value:'--min-rtt-timeout 1s', or choose a slower template: '-T2'. Either of these options will resolve the problem reported in this defect. The user needs to put either of the options in the site.nmapoptions.

     
  • Ling

    Ling - 2014-10-22

    code checked in for xCAT 2.9 in revision 73d08b.

     
  • Lissa Valletta

    Lissa Valletta - 2014-10-23

    I changed the title because this does not just affect xdsh/xdcp. It affects nodestat, pping, etc. It is really an nmap problem. If searching we probably want to find nmap in the title.

     
  • ting ting li

    ting ting li - 2014-12-10
    • status: pending --> closed
     
MongoDB Logo MongoDB