[Ganglia-general] Ganglia condor problem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi, I have a problem configurating ganglia in a cluster with condor. I
have the master node with gmetad and gmond running, and then other
computing nodes with gmond installed and running in port 8649.

I do telnet host 8649 and I get an xml with ganglia info. OK!.

But when I enter my ganglia web frontend I can't see any of my computing
nodes.

Here are my gmond.conf and gmetad.conf. 

gmond.conf

# This is an example of a Ganglia Meta Daemon configuration file
#                http://ganglia.sourceforge.net/
#
# $Id: gmetad.conf 667 2006-07-20 08:49:41Z knobi1 $
#
#-------------------------------------------------------------------------------
# Setting the debug_level to 1 will keep daemon in the forground and
# show only error messages. Setting this value higher than 1 will make
# gmetad output debugging information and stay in the foreground.
# default: 0
# debug_level 10
#
#-------------------------------------------------------------------------------
# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a
complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another
gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port
addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval
in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost  my.machine.edu:8649
1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651
grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655  1.3.4.8

data_source "LINCE" 192.168.0.3:8649 192.168.0.4:8649 192.168.0.5:8649
192.168.0.5:8649 192.168.0.6:8649 192.168.0.7:8649 192.168.0.8:8649
192.168.0.9:8649 192.168.0.10:8649 192.168.0.11:8649 192.168.0.12:8649
192.168.0.13:8649 192.168.0.14:8649 192.168.0.15:8649 192.168.19.16:8649

#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed
below)
#
# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244"
"RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
#      "RRA:AVERAGE:0.5:5760:374"
#

#
#-------------------------------------------------------------------------------
# Scalability mode. If on, we summarize over downstream grids, and
respect
# authority tags. If off, we take on 2.5.0-era behavior: we do not wrap
our output
# in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always
assume
# we are the "authority" on data source feeds. This approach does not
scale to
# large groups of clusters, but is provided for backwards compatibility.
# default: on
scalable on
#
#-------------------------------------------------------------------------------
# The name of this Grid. All the data sources above will be wrapped in a
GRID
# tag with this name.
# default: Unspecified

gridname "LINCE"
#
#-------------------------------------------------------------------------------
# The authority URL for this grid. Used by other gmetads to locate
graphs
# for our data sources. Generally points to a ganglia/
# website on this machine.

#authority "http://localhost/ganglia/"
#   where hostname is the name of this machine, as defined by
gethostname().
# authority "http://mycluster.org/newprefix/"
#
#-------------------------------------------------------------------------------
# List of machines this gmetad will share XML with. Localhost
# is always trusted.
# default: There is no default value
# trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org
#
#-------------------------------------------------------------------------------
# If you want any host which connects to the gmetad XML to receive
# data, then set this value to "on"
# default: off

all_trusted on

#
#-------------------------------------------------------------------------------
# If you don't want gmetad to setuid then set this to off
# default: on
# setuid off
#
#-------------------------------------------------------------------------------
# User gmetad will setuid to (defaults to "nobody")
# default: "nobody"
# setuid_username "nobody"
#
#-------------------------------------------------------------------------------
# The port gmetad will answer requests for XML
# default: 8651
xml_port 8649
#
#-------------------------------------------------------------------------------
# The port gmetad will answer queries for XML. This facility allows
# simple subtree and summation views of the XML tree.
# default: 8652
# interactive_port 8652
#
#-------------------------------------------------------------------------------
# The number of threads answering XML requests
# default: 4
# server_threads 10
#
#-------------------------------------------------------------------------------
# Where gmetad stores its round-robin databases
# default: "/var/lib/ganglia/rrds"

rrd_rootdir "/home/ganglia/bb_dd"

[globus@myr1 etc]$ cat gmond.conf
/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
  daemonize = yes
  setuid = yes
  user = nobody
  debug_level = 0
  max_udp_msg_len = 1472
  mute = no
  deaf = no
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped
inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all
<HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "Cluster"
  owner = "unspecified"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location
*/
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  mcast_join = 239.2.11.71
  port = 8649
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.71
  port = 8649
  bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}

/* The old internal 2.5.x metric array has been replaced by the
following
   collection_group directives.  What follows is the default behavior
for
   collecting and sending metrics that is as close to 2.5.x behavior as
   possible. */

/* This collection group will cause a heartbeat (or beacon) to be sent
every
   20 seconds.  In the heartbeat is the GMOND_STARTED data which
expresses
   the age of the running gmond. */
collection_group {
  collect_once = yes
  time_threshold = 20
  metric {
    name = "heartbeat"
  }
}

/* This collection group will send general info about this host every
1200 secs.
   This information doesn't change between reboots and is only collected
once. */
collection_group {
  collect_once = yes
  time_threshold = 1200
  metric {
    name = "cpu_num"
  }
  metric {
    name = "cpu_speed"
  }
  metric {
    name = "mem_total"
  }
  /* Should this be here? Swap can be added/removed between reboots. */
  metric {
    name = "swap_total"
  }
  metric {
    name = "boottime"
  }
  metric {
    name = "machine_type"
  }
  metric {
    name = "os_name"
  }
  metric {
    name = "os_release"
  }
  metric {
    name = "location"
  }
}

/* This collection group will send the status of gexecd for this host
every 300 secs */
/* Unlike 2.5.x the default behavior is to report gexecd OFF.  */
collection_group {
  collect_once = yes
  time_threshold = 300
  metric {
    name = "gexec"
  }
}

/* This collection group will collect the CPU status info every 20 secs.
   The time threshold is set to 90 seconds.  In honesty, this
time_threshold could be
   set significantly higher to reduce unneccessary network chatter. */
collection_group {
  collect_every = 20
  time_threshold = 90
  /* CPU status */
  metric {
    name = "cpu_user"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_system"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_idle"
    value_threshold = "5.0"
  }
  metric {
    name = "cpu_nice"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_aidle"
    value_threshold = "5.0"
  }
  metric {
    name = "cpu_wio"
    value_threshold = "1.0"
  }
  /* The next two metrics are optional if you want more detail...
     ... since they are accounted for in cpu_system.
  metric {
    name = "cpu_intr"
    value_threshold = "1.0"
  }
  metric {
    name = "cpu_sintr"
    value_threshold = "1.0"
  }
  */
}

collection_group {
  collect_every = 20
  time_threshold = 90
  /* Load Averages */
  metric {
    name = "load_one"
    value_threshold = "1.0"
  }
  metric {
    name = "load_five"
    value_threshold = "1.0"
  }
  metric {
    name = "load_fifteen"
    value_threshold = "1.0"
  }
}

/* This group collects the number of running and total processes */
collection_group {
  collect_every = 80
  time_threshold = 950
  metric {
    name = "proc_run"
    value_threshold = "1.0"
  }
  metric {
    name = "proc_total"
    value_threshold = "1.0"
  }
}

/* This collection group grabs the volatile memory metrics every 40 secs
and
   sends them at least every 180 secs.  This time_threshold can be
increased
   significantly to reduce unneeded network traffic. */
collection_group {
  collect_every = 40
  time_threshold = 180
  metric {
    name = "mem_free"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_shared"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_buffers"
    value_threshold = "1024.0"
  }
  metric {
    name = "mem_cached"
    value_threshold = "1024.0"
  }
  metric {
    name = "swap_free"
    value_threshold = "1024.0"
  }
}

collection_group {
  collect_every = 40
  time_threshold = 300
  metric {
    name = "bytes_out"
    value_threshold = 4096
  }
  metric {
    name = "bytes_in"
    value_threshold = 4096
  }
  metric {
    name = "pkts_in"
    value_threshold = 256
  }
  metric {
    name = "pkts_out"
    value_threshold = 256
  }
}

/* Different than 2.5.x default since the old config made no sense */
collection_group {
  collect_every = 1800
  time_threshold = 3600
  metric {
    name = "disk_total"
    value_threshold = 1.0
  }
}

collection_group {
  collect_every = 40
  time_threshold = 180
  metric {
    name = "disk_free"
    value_threshold = 1.0
  }
  metric {
    name = "part_max_used"
    value_threshold = 1.0
  }
}

gmetad.conf

# This is an example of a Ganglia Meta Daemon configuration file
#                http://ganglia.sourceforge.net/
#
# $Id: gmetad.conf 667 2006-07-20 08:49:41Z knobi1 $
#
#-------------------------------------------------------------------------------
# Setting the debug_level to 1 will keep daemon in the forground and
# show only error messages. Setting this value higher than 1 will make
# gmetad output debugging information and stay in the foreground.
# default: 0
# debug_level 10
#
#-------------------------------------------------------------------------------
# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a
complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another
gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port
addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval
in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost  my.machine.edu:8649
1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651
grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655  1.3.4.8

data_source "LINCE" 192.168.0.3:8649 192.168.0.4:8649 192.168.0.5:8649
192.168.0.5:8649 192.168.0.6:8649 192.168.0.7:8649 192.168.0.8:8649
192.168.0.9:8649 192.168.0.10:8649 192.168.0.11:8649 192.168.0.12:8649
192.168.0.13:8649 192.168.0.14:8649 192.168.0.15:8649 192.168.19.16:8649

#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed
below)
#
# RRAs "RRA:AVERAGE:0.5:1:244" "RRA:AVERAGE:0.5:24:244"
"RRA:AVERAGE:0.5:168:244" "RRA:AVERAGE:0.5:672:244" \
#      "RRA:AVERAGE:0.5:5760:374"
#

#
#-------------------------------------------------------------------------------
# Scalability mode. If on, we summarize over downstream grids, and
respect
# authority tags. If off, we take on 2.5.0-era behavior: we do not wrap
our output
# in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always
assume
# we are the "authority" on data source feeds. This approach does not
scale to
# large groups of clusters, but is provided for backwards compatibility.
# default: on
scalable on
#
#-------------------------------------------------------------------------------
# The name of this Grid. All the data sources above will be wrapped in a
GRID
# tag with this name.
# default: Unspecified

gridname "LINCE"
#
#-------------------------------------------------------------------------------
# The authority URL for this grid. Used by other gmetads to locate
graphs
# for our data sources. Generally points to a ganglia/
# website on this machine.

#authority "http://localhost/ganglia/"
#   where hostname is the name of this machine, as defined by
gethostname().
# authority "http://mycluster.org/newprefix/"
#
#-------------------------------------------------------------------------------
# List of machines this gmetad will share XML with. Localhost
# is always trusted.
# default: There is no default value
# trusted_hosts 127.0.0.1 169.229.50.165 my.gmetad.org
#
#-------------------------------------------------------------------------------
# If you want any host which connects to the gmetad XML to receive
# data, then set this value to "on"
# default: off

all_trusted on

#
#-------------------------------------------------------------------------------
# If you don't want gmetad to setuid then set this to off
# default: on
# setuid off
#
#-------------------------------------------------------------------------------
# User gmetad will setuid to (defaults to "nobody")
# default: "nobody"
# setuid_username "nobody"
#
#-------------------------------------------------------------------------------
# The port gmetad will answer requests for XML
# default: 8651
xml_port 8649
#
#-------------------------------------------------------------------------------
# The port gmetad will answer queries for XML. This facility allows
# simple subtree and summation views of the XML tree.
# default: 8652
# interactive_port 8652
#
#-------------------------------------------------------------------------------
# The number of threads answering XML requests
# default: 4
# server_threads 10
#
#-------------------------------------------------------------------------------
# Where gmetad stores its round-robin databases
# default: "/var/lib/ganglia/rrds"

rrd_rootdir "/home/ganglia/bb_dd"

Does somebody know how to solve my problem, or a way to find out what is
working bad?.

Thanks.

-- 
Iván Fernández Hernández
Becario investigador
Instituto de investigación en Informática de Albacete (I3A)

[Ganglia-general] Ganglia condor problem

Scalable, distributed monitoring system for high-performance computing

[Ganglia-general] Ganglia condor problem