You can subscribe to this list here.
2002 |
Jan
(6) |
Feb
(7) |
Mar
(26) |
Apr
(84) |
May
(60) |
Jun
(35) |
Jul
(72) |
Aug
(30) |
Sep
(16) |
Oct
(94) |
Nov
(53) |
Dec
(39) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
(53) |
Feb
(39) |
Mar
(56) |
Apr
(44) |
May
(37) |
Jun
(83) |
Jul
(32) |
Aug
(42) |
Sep
(41) |
Oct
(41) |
Nov
(41) |
Dec
(42) |
2004 |
Jan
(43) |
Feb
(31) |
Mar
(53) |
Apr
(50) |
May
(34) |
Jun
(50) |
Jul
(13) |
Aug
(20) |
Sep
(48) |
Oct
(6) |
Nov
(40) |
Dec
(22) |
2005 |
Jan
(43) |
Feb
(69) |
Mar
(41) |
Apr
(34) |
May
(36) |
Jun
(50) |
Jul
(40) |
Aug
(64) |
Sep
(47) |
Oct
(52) |
Nov
(64) |
Dec
(50) |
2006 |
Jan
(100) |
Feb
(74) |
Mar
(95) |
Apr
(64) |
May
(81) |
Jun
(56) |
Jul
(35) |
Aug
(52) |
Sep
(43) |
Oct
(45) |
Nov
(50) |
Dec
(45) |
2007 |
Jan
(71) |
Feb
(16) |
Mar
(49) |
Apr
(45) |
May
(31) |
Jun
(29) |
Jul
(77) |
Aug
(32) |
Sep
(83) |
Oct
(82) |
Nov
(68) |
Dec
(47) |
2008 |
Jan
(65) |
Feb
(78) |
Mar
(98) |
Apr
(97) |
May
(72) |
Jun
(133) |
Jul
(92) |
Aug
(140) |
Sep
(95) |
Oct
(85) |
Nov
(107) |
Dec
(27) |
2009 |
Jan
(42) |
Feb
(23) |
Mar
(36) |
Apr
(24) |
May
(76) |
Jun
(51) |
Jul
(86) |
Aug
(71) |
Sep
(82) |
Oct
(88) |
Nov
(136) |
Dec
(74) |
2010 |
Jan
(64) |
Feb
(67) |
Mar
(63) |
Apr
(52) |
May
(65) |
Jun
(105) |
Jul
(72) |
Aug
(52) |
Sep
(77) |
Oct
(121) |
Nov
(116) |
Dec
(83) |
2011 |
Jan
(56) |
Feb
(33) |
Mar
(145) |
Apr
(98) |
May
(111) |
Jun
(99) |
Jul
(61) |
Aug
(49) |
Sep
(42) |
Oct
(79) |
Nov
(55) |
Dec
(78) |
2012 |
Jan
(18) |
Feb
(100) |
Mar
(81) |
Apr
(41) |
May
(93) |
Jun
(46) |
Jul
(90) |
Aug
(64) |
Sep
(59) |
Oct
(131) |
Nov
(31) |
Dec
(39) |
2013 |
Jan
(29) |
Feb
(46) |
Mar
(47) |
Apr
(22) |
May
(32) |
Jun
(41) |
Jul
(67) |
Aug
(44) |
Sep
(41) |
Oct
(39) |
Nov
(38) |
Dec
(33) |
2014 |
Jan
(40) |
Feb
(37) |
Mar
(142) |
Apr
(43) |
May
(26) |
Jun
(14) |
Jul
(26) |
Aug
(40) |
Sep
(22) |
Oct
(22) |
Nov
(26) |
Dec
(28) |
2015 |
Jan
(17) |
Feb
(50) |
Mar
(40) |
Apr
(15) |
May
(23) |
Jun
(33) |
Jul
(8) |
Aug
(21) |
Sep
(20) |
Oct
(19) |
Nov
(25) |
Dec
(18) |
2016 |
Jan
(19) |
Feb
(14) |
Mar
(11) |
Apr
(37) |
May
(6) |
Jun
(9) |
Jul
(3) |
Aug
(7) |
Sep
(6) |
Oct
(12) |
Nov
(2) |
Dec
(7) |
2017 |
Jan
|
Feb
(11) |
Mar
(14) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(3) |
Aug
(7) |
Sep
(5) |
Oct
|
Nov
(1) |
Dec
|
2018 |
Jan
|
Feb
(2) |
Mar
(4) |
Apr
(2) |
May
(2) |
Jun
(2) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
(2) |
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
(1) |
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
(12) |
Sep
|
Oct
(2) |
Nov
(1) |
Dec
|
2020 |
Jan
|
Feb
|
Mar
(7) |
Apr
(4) |
May
(1) |
Jun
|
Jul
(4) |
Aug
(4) |
Sep
(5) |
Oct
|
Nov
|
Dec
(2) |
2021 |
Jan
(6) |
Feb
|
Mar
|
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(5) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
(1) |
Nov
|
Dec
|
2024 |
Jan
|
Feb
(6) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: David S. <da...@dr...> - 2003-03-13 23:58:37
|
Hello, I am new to ganglia and had a question about the capabilities of ganglia. Is it possible for remote devices to push their stats to a another system at a set public ip address. The scenario I have is: Boxes all over the country without multicast access. They are at client locations. Some will have public ip addresses, but most won't. Furthermore, they may be behind firewalls or routers. I want to load the monitor daemon on them. Then have one server that would collect the stats of the monitored systems for central display and tracking. I have a t-1 and plenty of public ips to dedicate to the the server side of this. With that scenario, would ganglia be capable of providing its functionality in an environment where we cannot rely on the ability to initiate a connection to the monitored systems. But must instead have the monitored systems report their status to to the ganglia server. I believe this is a simple question, and I am sure it can be answered easily. I just want to know if ganglia fits my needs, or should I continue to look elsewhere. Thank You, David Sumner ________________________________________________________________ Sent via the WebMail system at dragontail.com |
From: Marc R. <ma...@pa...> - 2003-03-12 19:53:24
|
Ganglia-General, I'm trying to install ganglia on a linux cluster. Each node is a dual-cpu 1.4 GHz P3 system. The manager has local disks, is running the 2.4.17 kernel, and has two ethernet interfaces (eth0=10.1.1.1 going to the compute nodes and eth1 going to the outside). The compute nodes are diskless and are running the 2.4.3 kernel and have only one ethernet interface each (10.1.1.*). Manager and compute nodes are connected with a 100baseT switch. Everything went normally on the manager installation. gmond came up and I can see the manager node with gstat (and gmetad and webfrontend). # gstat -a CLUSTER INFORMATION Name: unspecified Hosts: 2 Gexec Hosts: 0 Dead Hosts: 0 Localtime: Wed Mar 12 11:23:51 2003 CLUSTER HOSTS Hostname LOAD CPU Gexec CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle] batt001 2 ( 3/ 119) [ 0.86, 0.28, 0.31] [ 13.8, 0.0, 12.1, 77.5] OFF When I installed it on a compute node (batt016), it segfaulted until I added the route, route add -host 239.2.11.71 dev eth0 [root@batt016 ~]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0 10.1.1.0 * 255.255.255.0 U 0 0 0 eth0 127.0.0.0 * 255.0.0.0 U 0 0 0 lo and then it worked OK. # gstat -a CLUSTER INFORMATION Name: unspecified Hosts: 1 Gexec Hosts: 0 Dead Hosts: 1 Localtime: Wed Mar 12 11:24:11 2003 CLUSTER HOSTS Hostname LOAD CPU Gexec CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle] batt016 2 ( 0/ 33) [ 0.13, 0.03, 0.00] [ 0.6, 0.0, 0.0, 100.0] OFF Shortly after starting gmond on the compute node, it appears in the manager's gstat output with partial information, # gstat -a CLUSTER INFORMATION Name: unspecified Hosts: 2 Gexec Hosts: 0 Dead Hosts: 0 Localtime: Wed Mar 12 11:23:51 2003 CLUSTER HOSTS Hostname LOAD CPU Gexec CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle] batt001 2 ( 3/ 119) [ 0.86, 0.28, 0.31] [ 13.8, 0.0, 12.1, 77.5] OFF batt016 0 ( 0/ 0) [ 0.00, 0.00, 0.00] [ 0.0, 0.0, 0.0, 0.0] OFF but after a few minutes, it is declared dead, # gstat -d CLUSTER INFORMATION Name: unspecified Hosts: 1 Gexec Hosts: 0 Dead Hosts: 1 Localtime: Wed Mar 12 11:26:39 2003 DEAD CLUSTER HOSTS Hostname Last Reported batt016 Wed Mar 12 11:25:17 2003 On the compute node, gstat never shows any information about the manager node. When I ping the multicast address from the manager, it usually only gets responses from itself, though every once in a while it will see a response from the compute node: # ping 239.2.11.71 PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec 64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=26 usec 64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=44 usec 64 bytes from 10.1.1.16: icmp_seq=2 ttl=255 time=166 usec (DUP!) --- 239.2.11.71 ping statistics --- 3 packets transmitted, 3 packets received, +1 duplicates, 0% packet loss round-trip min/avg/max/mdev = 0.026/0.076/0.166/0.054 ms # ping 239.2.11.71 PING 239.2.11.71 (239.2.11.71) from 10.1.1.1 : 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=57 usec 64 bytes from 10.1.1.1: icmp_seq=1 ttl=255 time=37 usec 64 bytes from 10.1.1.1: icmp_seq=2 ttl=255 time=33 usec 64 bytes from 10.1.1.1: icmp_seq=3 ttl=255 time=33 usec 64 bytes from 10.1.1.1: icmp_seq=4 ttl=255 time=36 usec 64 bytes from 10.1.1.1: icmp_seq=5 ttl=255 time=36 usec 64 bytes from 10.1.1.1: icmp_seq=6 ttl=255 time=31 usec 64 bytes from 10.1.1.1: icmp_seq=7 ttl=255 time=31 usec 64 bytes from 10.1.1.1: icmp_seq=8 ttl=255 time=28 usec 64 bytes from 10.1.1.1: icmp_seq=9 ttl=255 time=44 usec 64 bytes from 10.1.1.1: icmp_seq=10 ttl=255 time=30 usec 64 bytes from 10.1.1.1: icmp_seq=11 ttl=255 time=31 usec 64 bytes from 10.1.1.1: icmp_seq=12 ttl=255 time=30 usec Here is the routing table on the manager: [root@gb0007 /]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0 10.1.1.0 * 255.255.255.0 U 0 0 0 eth0 192.168.0.0 * 255.255.240.0 U 0 0 0 eth1 127.0.0.0 * 255.0.0.0 U 0 0 0 lo 224.0.0.0 * 240.0.0.0 U 0 0 0 eth0 default gtwy 0.0.0.0 UG 0 0 0 eth1 Pinging the multicast address from the compute node only gets responses from itself, [root@batt016 ~]# ping 239.2.11.71 PING 239.2.11.71 (239.2.11.71) from 10.1.1.16 : 56(84) bytes of data. 64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=45 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=1 ttl=255 time=13 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=2 ttl=255 time=9 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=3 ttl=255 time=8 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=4 ttl=255 time=7 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=5 ttl=255 time=7 usec 64 bytes from batt016 (10.1.1.16): icmp_seq=6 ttl=255 time=8 usec When I ping the entire multicast network from the manager, I see responses from all nodes, [root@gb0007 /]# ping 224.0.0.1 PING 224.0.0.1 (224.0.0.1) from 10.1.1.1 : 56(84) bytes of data. 64 bytes from 10.1.1.1: icmp_seq=0 ttl=255 time=68 usec 64 bytes from 10.1.1.14: icmp_seq=0 ttl=255 time=181 usec (DUP!) 64 bytes from 10.1.1.12: icmp_seq=0 ttl=255 time=183 usec (DUP!) 64 bytes from 10.1.1.10: icmp_seq=0 ttl=255 time=196 usec (DUP!) 64 bytes from 10.1.1.7: icmp_seq=0 ttl=255 time=207 usec (DUP!) 64 bytes from 10.1.1.6: icmp_seq=0 ttl=255 time=210 usec (DUP!) 64 bytes from 10.1.1.9: icmp_seq=0 ttl=255 time=225 usec (DUP!) 64 bytes from 10.1.1.11: icmp_seq=0 ttl=255 time=234 usec (DUP!) 64 bytes from 10.1.1.13: icmp_seq=0 ttl=255 time=244 usec (DUP!) 64 bytes from 10.1.1.8: icmp_seq=0 ttl=255 time=254 usec (DUP!) 64 bytes from 10.1.1.15: icmp_seq=0 ttl=255 time=263 usec (DUP!) 64 bytes from 10.1.1.16: icmp_seq=0 ttl=255 time=273 usec (DUP!) 64 bytes from 10.1.1.2: icmp_seq=0 ttl=255 time=283 usec (DUP!) 64 bytes from 10.1.1.4: icmp_seq=0 ttl=255 time=293 usec (DUP!) 64 bytes from 10.1.1.5: icmp_seq=0 ttl=255 time=302 usec (DUP!) 64 bytes from 10.1.1.3: icmp_seq=0 ttl=255 time=312 usec (DUP!) 64 bytes from 10.1.1.251: icmp_seq=0 ttl=255 time=530 usec (DUP!) On the compute node, I can't ping the entire network, [root@batt016 ~]# ping 224.0.0.1 connect: Network is unreachable unless I add that route, [root@batt016 ~]# route add -net 224.0.0.0 netmask 240.0.0.0 dev eth0 [root@batt016 ~]# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 224.0.0.0 * 255.255.255.255 UH 0 0 0 eth0 239.2.11.71 * 255.255.255.255 UH 0 0 0 eth0 10.1.1.0 * 255.255.255.0 U 0 0 0 eth0 127.0.0.0 * 255.0.0.0 U 0 0 0 lo 224.0.0.0 * 240.0.0.0 U 0 0 0 eth0 then it can see all the hosts, [root@batt016 ~]# ping 224.0.0.1 PING 224.0.0.1 (224.0.0.1) from 10.1.1.16 : 56(84) bytes of data. 64 bytes from batt016 (10.1.1.16): icmp_seq=0 ttl=255 time=49 usec 64 bytes from batt001 (10.1.1.1): icmp_seq=0 ttl=255 time=270 usec (DUP!) 64 bytes from batt008 (10.1.1.8): icmp_seq=0 ttl=255 time=308 usec (DUP!) 64 bytes from batt006 (10.1.1.6): icmp_seq=0 ttl=255 time=321 usec (DUP!) 64 bytes from batt010 (10.1.1.10): icmp_seq=0 ttl=255 time=351 usec (DUP!) 64 bytes from batt009 (10.1.1.9): icmp_seq=0 ttl=255 time=369 usec (DUP!) 64 bytes from batt015 (10.1.1.15): icmp_seq=0 ttl=255 time=390 usec (DUP!) 64 bytes from batt014 (10.1.1.14): icmp_seq=0 ttl=255 time=412 usec (DUP!) 64 bytes from batt013 (10.1.1.13): icmp_seq=0 ttl=255 time=420 usec (DUP!) 64 bytes from batt011 (10.1.1.11): icmp_seq=0 ttl=255 time=431 usec (DUP!) 64 bytes from batt002 (10.1.1.2): icmp_seq=0 ttl=255 time=456 usec (DUP!) 64 bytes from batt007 (10.1.1.7): icmp_seq=0 ttl=255 time=471 usec (DUP!) 64 bytes from batt012 (10.1.1.12): icmp_seq=0 ttl=255 time=513 usec (DUP!) 64 bytes from batt003 (10.1.1.3): icmp_seq=0 ttl=255 time=523 usec (DUP!) 64 bytes from batt005 (10.1.1.5): icmp_seq=0 ttl=255 time=532 usec (DUP!) 64 bytes from batt004 (10.1.1.4): icmp_seq=0 ttl=255 time=548 usec (DUP!) 64 bytes from batt251 (10.1.1.251): icmp_seq=0 ttl=255 time=619 usec (DUP!) I have a second cluster on which I've installed ROCKS, and gmond works there. Each node can see all the others with gstat, and pinging 239.2.11.71 from any node gets responses from all nodes. I can't figure out why one system works and the other doesn't. The routing tables are the same. The kernels are different because I need nfs-root support for my diskless compute nodes, but I don't see why that should matter. It seems like there's something intermittent about the multicast connection from compute node to manager. Some partial data gets through on the first attempt, but after that nothing gets through. And nothing ever goes from the manager to the compute node. Have I made a mistake in my multicast configuration? I've tried several different versions of gmond, 2.5.1-1, 2.5.1-3, and 2.5.3-1, and they all have the same behavior. I've tried running in debug mode but that hasn't turned up any smoking guns. Can anyone suggest what's going wrong and how to fix it? Thanks! |
From: Leif N. <ni...@ns...> - 2003-03-11 20:59:55
|
"Steven A. DuChene" <lin...@mi...> writes: > There is actually a section of the ganglia on-line documentation > that covers this exact problem. I had the same or similar issues > until I followed the directions in section 8.3 of the ganglia > on-line documentation. Yeah, just go ahead, rub it in. I deserve it. 8^) -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Steven A. D. <lin...@mi...> - 2003-03-11 20:53:55
|
There is actually a section of the ganglia on-line documentation that covers this exact problem. I had the same or similar issues until I followed the directions in section 8.3 of the ganglia on-line documentation. -- Steven A. DuChene Linux cluster consultant Wash DC & Atlanta, GA -------Original Message------- From: Leif Nixon <ni...@ns...> Sent: 03/11/03 03:26 PM To: Steven Wagner <sw...@il...> Subject: Re: [Ganglia-general] Cluster frontend not reporting > > Steven Wagner <sw...@il...> writes: > I guess it's possible that g0 is sending metrics on one interface and > listening on another Which is exactly what's happening, according to my trusty tcpdump. Is that supposed to be possible? 8^) -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Leif N. <ni...@ns...> - 2003-03-11 20:53:39
|
Steven Wagner <sw...@il...> writes: > It might be time for you to add a route to your routing table for the > multicast IP that directs it to the proper interface... :) Well, duh. I should have figured that out. Thanks! -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Steven W. <sw...@il...> - 2003-03-11 20:33:49
|
Leif Nixon wrote: > Steven Wagner <sw...@il...> writes: > > >>I guess it's possible that g0 is sending metrics on one interface and >>listening on another > > > Which is exactly what's happening, according to my trusty tcpdump. > Is that supposed to be possible? 8^) > Errr... yes... that's a, um, feature. Are you explicitly specifying the interface in gmond.conf? It might be time for you to add a route to your routing table for the multicast IP that directs it to the proper interface... :) |
From: Leif N. <ni...@ns...> - 2003-03-11 20:26:14
|
Steven Wagner <sw...@il...> writes: > I guess it's possible that g0 is sending metrics on one interface and > listening on another Which is exactly what's happening, according to my trusty tcpdump. Is that supposed to be possible? 8^) -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Steven W. <sw...@il...> - 2003-03-11 19:04:55
|
Leif Nixon wrote: > Steven Wagner <sw...@il...> writes: > > >>That's how I found out that my front-end was *three* hops away from >>the test cluster and I'm thinking you have either a monitoring core >>config issue or a host/network config issue to track down... (maybe a >>host/network device between the front-end and the cluster is >>configured to drop multicast packets?) > > > But would that cause gmond on *g0* not to report metrics for *g0*? > That's what I'm looking at here. > DOH! Thank you, ladies and gentlemen, I'll be here all week. I guess it's possible that g0 is sending metrics on one interface and listening on another, or that the routing table or a firewall rule is in the way. The monitoring core seems to think everything is okay... Do the packets show up as going out on tcpdump? |
From: Leif N. <ni...@ns...> - 2003-03-11 18:56:05
|
Steven Wagner <sw...@il...> writes: > That's how I found out that my front-end was *three* hops away from > the test cluster and I'm thinking you have either a monitoring core > config issue or a host/network config issue to track down... (maybe a > host/network device between the front-end and the cluster is > configured to drop multicast packets?) But would that cause gmond on *g0* not to report metrics for *g0*? That's what I'm looking at here. -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Steven W. <sw...@il...> - 2003-03-11 17:51:02
|
Leif Nixon wrote: > Well, this is a new one - at least for me. > > One of our clusters was rebooted last week, due to a physical > relocation. Now the ganglia XML data doesn't contain any mention of > the cluster frontend, even though gmond is running fine and responding > to the XML data port: > > nixon $ telnet grendel 8649|grep -i "host name"|cut -c -60 > Connection closed by foreign host. > <!ATTLIST HOST NAME CDATA #REQUIRED> > <HOST NAME="g10" IP="192.168.1.10" REPORTED="1047377023" TN= > <HOST NAME="g11" IP="192.168.1.11" REPORTED="1047377026" TN= > <HOST NAME="g12" IP="192.168.1.12" REPORTED="1047377029" TN= > <HOST NAME="g13" IP="192.168.1.13" REPORTED="1047377026" TN= > <HOST NAME="g1" IP="192.168.1.1" REPORTED="1047377032" TN="0 > <HOST NAME="g2" IP="192.168.1.2" REPORTED="1047377029" TN="3 > <HOST NAME="g16" IP="192.168.1.16" REPORTED="1047377022" TN= > <HOST NAME="g4" IP="192.168.1.4" REPORTED="1047377025" TN="7 > <HOST NAME="g5" IP="192.168.1.5" REPORTED="1047377023" TN="9 > <HOST NAME="g6" IP="192.168.1.6" REPORTED="1047377031" TN="1 > <HOST NAME="g8" IP="192.168.1.8" REPORTED="1047377028" TN="4 > <HOST NAME="g9" IP="192.168.1.9" REPORTED="1047377022" TN="1 > nixon $ > > The frontend used to turn up as "g0". > > The same behaviour is presented by ganglia 2.5.1 and 2.5.3. I've run > gmond for a while with debug enabled, but nothing in the output seems > alarming to me. Anyone who wants to take a look can find the log at: > > http://www.nsc.liu.se/~nixon/tmp/ganglia.log > > What blindingly obvious mistake am I making here? > Maybe you should look at your mcast_ttl value for g0 (I'm assuming it's not running with the IP 192.168.1.0 ... :) ), increase it by one and restart the monitoring core until it shows up on the other gmonds. That's how I found out that my front-end was *three* hops away from the test cluster and I'm thinking you have either a monitoring core config issue or a host/network config issue to track down... (maybe a host/network device between the front-end and the cluster is configured to drop multicast packets?) I got my hopes up from reading the subject message - a problem I've noticed lately is that the metadaemon seems to "forget" to update all the RRDs for a cluster. But as it turns out, the metadaemon's behaving properly - the front-end isn't recognizing the cluster as down/unreachable, though! I'm not sure if this behavior appeared as a result of the modifications I've made to my installation though... |
From: Leif N. <ni...@ns...> - 2003-03-11 10:15:20
|
Well, this is a new one - at least for me. One of our clusters was rebooted last week, due to a physical relocation. Now the ganglia XML data doesn't contain any mention of the cluster frontend, even though gmond is running fine and responding to the XML data port: nixon $ telnet grendel 8649|grep -i "host name"|cut -c -60 Connection closed by foreign host. <!ATTLIST HOST NAME CDATA #REQUIRED> <HOST NAME="g10" IP="192.168.1.10" REPORTED="1047377023" TN= <HOST NAME="g11" IP="192.168.1.11" REPORTED="1047377026" TN= <HOST NAME="g12" IP="192.168.1.12" REPORTED="1047377029" TN= <HOST NAME="g13" IP="192.168.1.13" REPORTED="1047377026" TN= <HOST NAME="g1" IP="192.168.1.1" REPORTED="1047377032" TN="0 <HOST NAME="g2" IP="192.168.1.2" REPORTED="1047377029" TN="3 <HOST NAME="g16" IP="192.168.1.16" REPORTED="1047377022" TN= <HOST NAME="g4" IP="192.168.1.4" REPORTED="1047377025" TN="7 <HOST NAME="g5" IP="192.168.1.5" REPORTED="1047377023" TN="9 <HOST NAME="g6" IP="192.168.1.6" REPORTED="1047377031" TN="1 <HOST NAME="g8" IP="192.168.1.8" REPORTED="1047377028" TN="4 <HOST NAME="g9" IP="192.168.1.9" REPORTED="1047377022" TN="1 nixon $ The frontend used to turn up as "g0". The same behaviour is presented by ganglia 2.5.1 and 2.5.3. I've run gmond for a while with debug enabled, but nothing in the output seems alarming to me. Anyone who wants to take a look can find the log at: http://www.nsc.liu.se/~nixon/tmp/ganglia.log What blindingly obvious mistake am I making here? -- Leif Nixon Systems expert ------------------------------------------------------------ National Supercomputer Centre Linkoping University ------------------------------------------------------------ |
From: Steven W. <sw...@il...> - 2003-03-06 19:26:11
|
Henry Leyh wrote: > I cannot find anything unreasonable here. The polling interval seems to > be correct. Note that do not have private 192.168... addresses for the > cluster nodes. Yup, all that looks reasonable. My grab bag o' fixes is officially empty. :) One thing I guess you could try is removing the source that's being updated at the proper interval, leaving just the three-minute source. Then you could sit and watch (on debug) gmetad polling it. That would tell you whether it's polling the data at the right frequency or not, which would at least narrow down your list of troubleshooting targets. Good luck with it... |
From: matt m. <ma...@cs...> - 2003-03-06 19:09:00
|
yasuhito- this is a known bug and we are presently testing a solution which will be incorporated into the next release. -- matt Tomorrow, Yasuhito Takamiya wrote forth saying... > Hi, > > We are facing problems in running gmetad, we got following errors > in /var/log/messages in every 1.5 minutes. > > > Mar 2 06:56:08 lucie /usr/sbin/gmetad[13864]: summary_RRD_update: illegal attem > > pt to update using time 1046555768 when last update time is 1046555768 (minimum > > one second step) > > it seems rrd update is very frequent so this type of error occurs. > Has anyone else experienced this or know why it happens ? > We are using version 2.5.1 of gmetad. > > Thanks, > > --yasuhito > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger > for complex code. Debugging C/C++ programs can leave you feeling lost and > disoriented. TotalView can help you find your way. Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > Ganglia-general mailing list > Gan...@li... > https://lists.sourceforge.net/lists/listinfo/ganglia-general > |
From: Yasuhito T. <tak...@ma...> - 2003-03-06 19:05:06
|
Hi, We are facing problems in running gmetad, we got following errors in /var/log/messages in every 1.5 minutes. > Mar 2 06:56:08 lucie /usr/sbin/gmetad[13864]: summary_RRD_update: illegal attem > pt to update using time 1046555768 when last update time is 1046555768 (minimum > one second step) it seems rrd update is very frequent so this type of error occurs. Has anyone else experienced this or know why it happens ? We are using version 2.5.1 of gmetad. Thanks, --yasuhito |
From: Patrick L. <ple...@st...> - 2003-03-06 18:30:27
|
Hi, We are using ganglia for PBS job queue monitoring (from NPACI Rocks) but are having a problem in that the metrics representing jobs are set to expire after a few minutes but are persisting for days and weeks. It seems no cleanup is happening. Has anyone else experienced this or know why it happens ? We are using version 2.5.1 of gmetad and gmond. Thanks, Patrick --------------------------------------------- Patrick LeGresley Stanford University Department of Aeronautics and Astronautics http://www.stanford.edu/~plegresl ple...@st... --------------------------------------------- |
From: Federico S. <fd...@sd...> - 2003-03-05 19:54:46
|
Yes, I have seen this problem. I believe you need to set a higher memory limit for PHP. There is a line in /etc/php.ini called 'memory_limit'. Try making this something like '64M', and try again. Federico On Friday, February 28, 2003, at 07:26 PM, Vu, Phuong A (MP Technology) wrote: > > I am using 2.5.1 and have been running on a decent size grid of 14 > clusters, > each with 16 machines > without any real problem. When I was adding more cluster today, I > noticed > that I start seeing this error > > Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to > allocate 184320 bytes) in > /var/www/html/ganglia-webfrontend-2.5.1/...... > I have looked in the discussion archive but did not see reference to > this. > Have anybody seen this before ? The webfrontend machine is a 2cpus/1GB > Pentium III 800 Mhz system. > Thanks, > Phuong Vu > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Ganglia-developers mailing list > Gan...@li... > https://lists.sourceforge.net/lists/listinfo/ganglia-developers > Federico Rocks Cluster Group, SDSC, San Diego GPG Fingerprint: 3C5E 47E7 BDF8 C14E ED92 92BB BA86 B2E6 0390 8845 |
From: Steven W. <sw...@il...> - 2003-03-05 18:47:53
|
Henry Leyh wrote: > Hi, > We have the ganglia monitor core 2.5.2 installed on two clusters (20 and > 68 hosts, different subnets, connected via gmond's "trusted_hosts") and > watch it with gmetad/webfrontend 2.5.2 running on a machine which > belongs to one of the two clusters. What we observe now (after > installation of 2.5.2, not before) is that the rrd updates from the > "remote" cluster occur kind of less frequently (3 minute intervals) than > the updates from the cluster where the gmetad machine is a member of. > That shows in the graphs (sort of "blocky" for that cluster) as well as > in the update intervals of the rrd files. The gmetad polling interval is > explicitely set to 15 seconds for both data_sources. > > Any ideas why that is? > > Thanks & Greetings It's possible that gmetad is parsing the wrong area of the data_source line for the polling interval. This would make sense if your remote host IP starts with 192.168 (etc.) ... Run gmetad with debug output turned on and check the first 100-200 lines or so of output. I just got hit by this today, myself... before releasing 2.5.3, someone should take a look at the interval-parsing code (which apparently doesn't accomodate the old syntax) and put a note in the readme/Changelog... [not that I read the Changelog in this particular case myself :) ] |
From: Steven W. <sw...@il...> - 2003-03-05 18:13:58
|
Santanu Das wrote: > > Actually I did mean to say how to change the label like in spite of > "Unspecified Grid" some thing like "HEP DataDrid" or else. Did somebody say, "undocumented feature" ? gmetad and the web front-end control the "grid" stuff - this is a new feature addition as of 2.5.2, which was released maybe a little more quietly than it should have. I don't even know if these changes have been documented. But, for the record (take all this as "a bunch of stuff not from the guy who actually wrote these features" - I might not be 100% accurate): The two settings you'll want to change are "gridname" (for the proper name) and "authority" (for the URL). The defaults are, respectively, "Unspecified" and "http://$HOSTNAME/ganglia-webfrontend." Note that the "Grid" part is actually defined in one of the PHP scripts (pretty sure it's conf.php - grep for it and I'm sure you'll find it). It's described as the default name for a gmetad source, I believe. At the abstract level, "grid" is defined as the "cluster of clusters" level. The features that have been added to 2.5.2 make it possible for the web front-end to pull data from more than one gmetad source (basically, to cluster web front-ends!). Useful if you have a lot of clusters, I guess. :) |
From: Santanu D. <sa...@he...> - 2003-03-05 16:16:00
|
Tru Huynh wrote: > On Tue, Mar 04, 2003 at 10:31:50AM +0000, Santanu Das wrote: > >>Hi all, >> > > > Hi, > > Just a hint from a new Ganglia user. > > >>I've got 3 questions: >> >>1. I've total 17 nodes (single CPU) and I specified 2 data_source, >> namely 'Testbed Farm' with 16 nodes and 'Testbed SE' with only 1 >> node. But all together it's showing 34 CPUS; 17 for each >> data_source. Is it normal or am I doing some thing wrong with >> configuration? >> > You are probably using the same multicast channel for both > data_source. That was my problem with the this configuration Many thanks Tru, Now it's working. A wise advise!!! > file I made. > >>2. Every time when I'm restarting the 'gmetad', a error message is > > > /etc/gmetad.conf > ... > data_source "Testbed Farm" HEP.hostname > > data_source "Testbed SE" SE.hostname > ... > > >> question: is that okay or need to some thing else? > > gmond (on your 17 machines) and gmetad (on the web server) > should be started at boot time. > >>3. Now the report is being shown as "Unspecified Grid". What does >> it mean and how to change that with ours. > > There is a pull down at the right of the label. Actually I did mean to say how to change the label like in spite of "Unspecified Grid" some thing like "HEP DataDrid" or else. Cheers, Santanu > > Regards, > > Tru > |
From: Henry L. <hen...@ip...> - 2003-03-05 12:26:44
|
Hi, We have the ganglia monitor core 2.5.2 installed on two clusters (20 and 68 hosts, different subnets, connected via gmond's "trusted_hosts") and watch it with gmetad/webfrontend 2.5.2 running on a machine which belongs to one of the two clusters. What we observe now (after installation of 2.5.2, not before) is that the rrd updates from the "remote" cluster occur kind of less frequently (3 minute intervals) than the updates from the cluster where the gmetad machine is a member of. That shows in the graphs (sort of "blocky" for that cluster) as well as in the update intervals of the rrd files. The gmetad polling interval is explicitely set to 15 seconds for both data_sources. Any ideas why that is? Thanks & Greetings -- Henry |
From: Tru H. <tr...@pa...> - 2003-03-04 12:49:17
|
On Tue, Mar 04, 2003 at 10:31:50AM +0000, Santanu Das wrote: > > Hi all, > Hi, Just a hint from a new Ganglia user. > I've got 3 questions: > > 1. I've total 17 nodes (single CPU) and I specified 2 data_source, > namely 'Testbed Farm' with 16 nodes and 'Testbed SE' with only 1 > node. But all together it's showing 34 CPUS; 17 for each > data_source. Is it normal or am I doing some thing wrong with > configuration? > You are probably using the same multicast channel for both data_source. That was my problem with the this configuration file I made. /etc/gmond.conf-Testbed name "Testbed Farm" owner "HEP" mcast_channel 239.1.200.1 trusted_hosts .... num_nodes 17 setuid nobody no_gexec on /etc/gmond.conf-SE name "Testbed SE" owner "HEP" mcast_channel 239.1.1.1 trusted_hosts .... num_nodes 17 setuid nobody no_gexec on > 2. Every time when I'm restarting the 'gmetad', a error message is /etc/gmetad.conf ... data_source "Testbed Farm" HEP.hostname data_source "Testbed SE" SE.hostname ... > question: is that okay or need to some thing else? gmond (on your 17 machines) and gmetad (on the web server) should be started at boot time. > > 3. Now the report is being shown as "Unspecified Grid". What does > it mean and how to change that with ours. There is a pull down at the right of the label. Regards, Tru -- Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/ mailto:tr...@pa... | tel/fax +33 1 45 68 87 37/19 Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France |
From: Santanu D. <sa...@he...> - 2003-03-04 11:27:49
|
Hi all, I sent this mail few days back. As I didn't get any reply from any body, I made some thing wrong in sending I suppose. So, re-mailling it again.......... I've got 3 questions: 1. I've total 17 nodes (single CPU) and I specified 2 data_source, namely 'Testbed Farm' with 16 nodes and 'Testbed SE' with only 1 node. But all together it's showing 34 CPUS; 17 for each data_source. Is it normal or am I doing some thing wrong with configuration? http://farm002.hep.phy.cam.ac.uk/cavendish/ 2. Every time when I'm restarting the 'gmetad', a error message is coming (on the browser): "Ganglia cannot find a data source. Is gmond running?" and comes back with graphs etc. on restarting 'gmond' of any node from each data_source. Again the same question: is that okay or need to some thing else? 3. Now the report is being shown as "Unspecified Grid". What does it mean and how to change that with ours. Any help from any body highly appreciated. Regards, Santanu |
From: Vu, P. A (MP Technology) <Phu...@bp...> - 2003-03-02 05:31:47
|
That was the problem... after our webfrontend came back up today with the new php memory limit (16M), I have not run into the memory problem again. Thanks again Jeremy. Phuong -----Original Message----- From: Jeremy Weatherford [mailto:xi...@xi...] Sent: Saturday, March 01, 2003 12:39 AM To: Vu, Phuong A (MP Technology) Cc: gan...@li... Subject: RE: [Ganglia-general] memory error in webfrontend when adding mor e node/cluster Okay... it would probably be helpful to know the filename and line number where it ran out of memory, to see what it might be doing that requires so much memory. Also, I'm assuming you made sure that the new setting was actually applied (change reflected in the error message, etc.) Jeremy Weatherford xi...@xi... http://xidus.net On Sat, 1 Mar 2003, Vu, Phuong A (MP Technology) wrote: > > Thanks Jeremy... I try to bump the php memory to 16M but still ran > into the same error message. That is probably the problem but for > some reason it is not taking it. > > Thanks for the clue. > > Phuong > > -----Original Message----- > From: Jeremy Weatherford [mailto:xi...@xi...] > Sent: Friday, February 28, 2003 11:43 PM > To: Vu, Phuong A (MP Technology) > Cc: 'gan...@li...'; > 'gan...@li...' > Subject: Re: [Ganglia-general] memory error in webfrontend when adding > more node/cluster > > > Assuming that this isn't unusual behavior for the web frontend (runaway > memory allocation, or some such), then you just need to increase PHP's > memory_limit parameter, usually in /etc/php.ini. Yours is apparently set > at the default of 8M, I'd try increasing that to 16M or so before assuming > something's broken. > > Jeremy Weatherford > xi...@xi... > http://xidus.net > > > On Fri, 28 Feb 2003, Vu, Phuong A (MP Technology) wrote: > > > > > I am using 2.5.1 and have been running on a decent size grid of 14 > clusters, > > each with 16 machines > > without any real problem. When I was adding more cluster today, I noticed > > that I start seeing this error > > > > Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to > > allocate 184320 bytes) in /var/www/html/ganglia-webfrontend-2.5.1/...... > > I have looked in the discussion archive but did not see reference to this. > > Have anybody seen this before ? The webfrontend machine is a 2cpus/1GB > > Pentium III 800 Mhz system. > > Thanks, > > Phuong Vu > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Ganglia-general mailing list > > Gan...@li... > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > |
From: Vu, P. A (MP Technology) <Phu...@bp...> - 2003-03-01 06:50:27
|
My webfrontend machine is off the network now. I rebooted it and it won't come back and I can't get to it from home so I am will look into it sometimes tomorrow. I will send an update. Thanks again, Phuong -----Original Message----- From: Jeremy Weatherford [mailto:xi...@xi...] Sent: Saturday, March 01, 2003 12:39 AM To: Vu, Phuong A (MP Technology) Cc: gan...@li... Subject: RE: [Ganglia-general] memory error in webfrontend when adding mor e node/cluster Okay... it would probably be helpful to know the filename and line number where it ran out of memory, to see what it might be doing that requires so much memory. Also, I'm assuming you made sure that the new setting was actually applied (change reflected in the error message, etc.) Jeremy Weatherford xi...@xi... http://xidus.net On Sat, 1 Mar 2003, Vu, Phuong A (MP Technology) wrote: > > Thanks Jeremy... I try to bump the php memory to 16M but still ran > into the same error message. That is probably the problem but for > some reason it is not taking it. > > Thanks for the clue. > > Phuong > > -----Original Message----- > From: Jeremy Weatherford [mailto:xi...@xi...] > Sent: Friday, February 28, 2003 11:43 PM > To: Vu, Phuong A (MP Technology) > Cc: 'gan...@li...'; > 'gan...@li...' > Subject: Re: [Ganglia-general] memory error in webfrontend when adding > more node/cluster > > > Assuming that this isn't unusual behavior for the web frontend (runaway > memory allocation, or some such), then you just need to increase PHP's > memory_limit parameter, usually in /etc/php.ini. Yours is apparently set > at the default of 8M, I'd try increasing that to 16M or so before assuming > something's broken. > > Jeremy Weatherford > xi...@xi... > http://xidus.net > > > On Fri, 28 Feb 2003, Vu, Phuong A (MP Technology) wrote: > > > > > I am using 2.5.1 and have been running on a decent size grid of 14 > clusters, > > each with 16 machines > > without any real problem. When I was adding more cluster today, I noticed > > that I start seeing this error > > > > Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to > > allocate 184320 bytes) in /var/www/html/ganglia-webfrontend-2.5.1/...... > > I have looked in the discussion archive but did not see reference to this. > > Have anybody seen this before ? The webfrontend machine is a 2cpus/1GB > > Pentium III 800 Mhz system. > > Thanks, > > Phuong Vu > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Ganglia-general mailing list > > Gan...@li... > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > |
From: Jeremy W. <xi...@xi...> - 2003-03-01 06:39:52
|
Okay... it would probably be helpful to know the filename and line number where it ran out of memory, to see what it might be doing that requires so much memory. Also, I'm assuming you made sure that the new setting was actually applied (change reflected in the error message, etc.) Jeremy Weatherford xi...@xi... http://xidus.net On Sat, 1 Mar 2003, Vu, Phuong A (MP Technology) wrote: > > Thanks Jeremy... I try to bump the php memory to 16M but still ran > into the same error message. That is probably the problem but for > some reason it is not taking it. > > Thanks for the clue. > > Phuong > > -----Original Message----- > From: Jeremy Weatherford [mailto:xi...@xi...] > Sent: Friday, February 28, 2003 11:43 PM > To: Vu, Phuong A (MP Technology) > Cc: 'gan...@li...'; > 'gan...@li...' > Subject: Re: [Ganglia-general] memory error in webfrontend when adding > more node/cluster > > > Assuming that this isn't unusual behavior for the web frontend (runaway > memory allocation, or some such), then you just need to increase PHP's > memory_limit parameter, usually in /etc/php.ini. Yours is apparently set > at the default of 8M, I'd try increasing that to 16M or so before assuming > something's broken. > > Jeremy Weatherford > xi...@xi... > http://xidus.net > > > On Fri, 28 Feb 2003, Vu, Phuong A (MP Technology) wrote: > > > > > I am using 2.5.1 and have been running on a decent size grid of 14 > clusters, > > each with 16 machines > > without any real problem. When I was adding more cluster today, I noticed > > that I start seeing this error > > > > Fatal error: Allowed memory size of 8388608 bytes exhausted (tried to > > allocate 184320 bytes) in /var/www/html/ganglia-webfrontend-2.5.1/...... > > I have looked in the discussion archive but did not see reference to this. > > Have anybody seen this before ? The webfrontend machine is a 2cpus/1GB > > Pentium III 800 Mhz system. > > Thanks, > > Phuong Vu > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > Ganglia-general mailing list > > Gan...@li... > > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > |