Hi all,
The xdsh -v sometimes reports that a node is not responsible even if they are. This is caused by nmap not reporting correct status on Ubuntu 14.04 on ppc64le. Now the following xCAT commands are broken:
pping
nodestat
xdsh -v
I have tried different ways of using nmap and have not found a good solution yet. I also tried to increase the arp cache, it did not work. I'll try more on Monday.
Ling
Dave/Mark, I've asked Ling to look at the xdsh behavior on c656 for the issue that I mentioned to you on one of the scrum calls. Below are two examples of what I'm seeing. 1) xdsh is hanging on what I believe is a "bad node" case (ie, c656f4n15) when it should just bypass it when there's no connection 2) in this case, approximately 30 seconds later using the "-v" option, nodes in the same group that returned uptime results are now returning with "not responding" and then eventually the xdsh hangs on #1 above.
root@c656f2n03:~# xdsh f4 uptime
c656f4n01: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n07: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n08: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n02: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.06, 0.03, 0.05
c656f4n09: 09:55:45 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n10: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n04: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.01, 0.02, 0.05
c656f4n11: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n13: 09:55:47 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n12: 09:55:43 up 1 day, 19:34, 0 users, load average: 0.00, 0.02, 0.05
c656f4n16: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n14: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n03: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.07, 0.04, 0.05
c656f4n15: ssh: connect to host c656f4n15 port 22: No route to host
-----hangs here
^Croot@c656f2n03:~# xdsh f4 -v uptime
Error: c656f4n01 is not responding. No command will be issued to this host.
Error: c656f4n12 is not responding. No command will be issued to this host.
Error: c656f4n15 is not responding. No command will be issued to this host.
Error: c656f4n16 is not responding. No command will be issued to this host.
c656f4n02: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.04, 0.03, 0.05
c656f4n07: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n08: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n09: 09:56:09 up 1 day, 19:34, 0 users, load average: 0.00, 0.01, 0.05
c656f4n04: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.02, 0.05
c656f4n03: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.12, 0.05, 0.05
c656f4n10: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n14: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n11: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.01, 0.05
c656f4n13: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.06, 0.03, 0.05
I would for now not use -v and just use -t with a short timeout untils this is fixed.
Lisa, thanks for the suggestion. On c656 cluster (~30 nodes), "-t 4"
seems to do the trick.
Give a hoot --- don't reboot
James E. Still, PMP®
Office: 845-433-7841 / Cell: 845-389-1712
HPC Software Development
Systems & Technology Group, Dept 67LB
Poughkeepsie, NY
From: "Lissa Valletta" lissav@users.sf.net
To: "[xcat:bugs] " 4300@bugs.xcat.p.re.sf.net
Date: 10/20/2014 09:32 AM
Subject: [xcat:bugs] #4300 xdsh -v intermittent inconsistent status
I would for now not use -v and just use -t with a short timeout untils
this is fixed.
[bugs:#4300] xdsh -v intermittent inconsistent status
Status: open
Milestones: 2.9
Created: Sat Oct 18, 2014 08:12 PM UTC by James Still
Last Updated: Sat Oct 18, 2014 08:12 PM UTC
Owner: Ling
Hi all,
The xdsh -v sometimes reports that a node is not responsible even if they
are. This is caused by nmap not reporting correct status on Ubuntu 14.04
on ppc64le. Now the following xCAT commands are broken:
pping
nodestat
xdsh -v
I have tried different ways of using nmap and have not found a good
solution yet. I also tried to increase the arp cache, it did not work.
I'll try more on Monday.
Ling
Dave/Mark, I've asked Ling to look at the xdsh behavior on c656 for the
issue that I mentioned to you on one of the scrum calls. Below are two
examples of what I'm seeing. 1) xdsh is hanging on what I believe is a
"bad node" case (ie, c656f4n15) when it should just bypass it when there's
no connection 2) in this case, approximately 30 seconds later using the
"-v" option, nodes in the same group that returned uptime results are now
returning with "not responding" and then eventually the xdsh hangs on #1
above.
root@c656f2n03:~# xdsh f4 uptime
c656f4n01: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n07: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
0.05
c656f4n08: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n02: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.06, 0.03,
0.05
c656f4n09: 09:55:45 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
0.05
c656f4n10: 09:55:46 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
0.05
c656f4n04: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.01, 0.02,
0.05
c656f4n11: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n13: 09:55:47 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n12: 09:55:43 up 1 day, 19:34, 0 users, load average: 0.00, 0.02,
0.05
c656f4n16: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n14: 09:55:46 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n03: 09:55:45 up 1 day, 19:35, 0 users, load average: 0.07, 0.04,
0.05
c656f4n15: ssh: connect to host c656f4n15 port 22: No route to host
-----hangs here
^Croot@c656f2n03:~# xdsh f4 -v uptime
Error: c656f4n01 is not responding. No command will be issued to this
host.
Error: c656f4n12 is not responding. No command will be issued to this
host.
Error: c656f4n15 is not responding. No command will be issued to this
host.
Error: c656f4n16 is not responding. No command will be issued to this
host.
c656f4n02: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.04, 0.03,
0.05
c656f4n07: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n08: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n09: 09:56:09 up 1 day, 19:34, 0 users, load average: 0.00, 0.01,
0.05
c656f4n04: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.02,
0.05
c656f4n03: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.12, 0.05,
0.05
c656f4n10: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n14: 09:56:09 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n11: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.00, 0.01,
0.05
c656f4n13: 09:56:10 up 1 day, 19:35, 0 users, load average: 0.06, 0.03,
0.05
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/xcat/bugs/4300/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
Related
Bugs:
#4300updatenode is broken. It uses xdsh -v. The un-responding nodes will not get updated.
We'll add a site table variable called 'nmapoptions' to allow user to specify additional options to the nmap command used. The nmap is used in pping, xdsh -v, updatenode, nodestat , renergy etc. We'd also like to consolidate calling of nmap into one place using this defect.
Currently, the following files are calling nmap:
perl-xCAT/xCAT/NetworkUtils.pm
perl-xCAT/xCAT/PPCenergy.pm
perl-xCAT/xCAT/SLP.pm
xCAT-client/bin/pping
xCAT-server/lib/xcat/plugins/nodestat.pm
Last edit: Ling 2014-10-21
nmap sometimes give unstable output because of network response is slow. User can add additional options in nmap to find a good balance between time and performance. For example, increase the minimum timeout value:'--min-rtt-timeout 1s', or choose a slower template: '-T2'. Either of these options will resolve the problem reported in this defect. The user needs to put either of the options in the site.nmapoptions.
code checked in for xCAT 2.9 in revision 73d08b.
I changed the title because this does not just affect xdsh/xdcp. It affects nodestat, pping, etc. It is really an nmap problem. If searching we probably want to find nmap in the title.