Re: [Ganglia-developers] real-time-monitoring

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Thomas, I have some comments are interspersed below,
but this is from my memory only.

Quoting Thomas Spanner <tho...@gm...>:

>>> The task, in short, is the implementation of real-time-monitoring, i.e. a
>>> timeframe of about 30 minutes to 3 hours overall. gmond should collect the
>>> data every 1 or 2 seconds, so state changes can be immediately recognized
>>> and be dealt with.

OK. One of the good things about Ganglia is the relatively small time lag
between a change on a monitored host and you seeing this reflected on
the ganglia graphs. So small latency, ganglia good.

But there are traps for 1-2 second sampling/polling rates. From a host's gmond
point of view there are no intrinsic problems with gathering at that rate -
*except* the O/S itself may not want to provide it. e.g. cygwin itself smooths
user/system CPU data in /proc to a time period much longer than 2 seconds.
Polling faster than 5 seconds for cygwin does not reveal anything more.

And the Linux graphs don't look as "spiky" as the Solaris graphs, which
maybe implies the Linux /proc smooths some stuff too. Not checked though.

Now to gmetad. Each data source has its own thread, but both the polling
of the data source to get the XML and the updating of the RRD files happens
in a single thread. This is just fine when the polling rate is slow enough
to allow time for the thread to do both its polling and RRDupdating. But if
the thread runs out of time before the next poll, you will lose data and
potentially have gaps in the RRD graphs. Gmetad is not a fail graceful queuing
design. I say more about gmetad below.

 >>> 
>>> So far, we have been studying the code and the rrdtool, and hopefully
>>> understood how this things work.
>>> 
>>> Before we play around with the variables, I'd like to ask your opinion:
>>> 
>>> 1. can this be accomplished? My concern is that the overhead gets too much,
>>> and ganglia slows the whole cluster down (there must be a reason why the
>>> step variable of the gmetad is 15 seconds as default.)

No, maybe, and with some effort, very nearly.

There is code in gmetad that makes the assumptions that the
polling rate of metrics is longer than some amount (say 20 seconds).
For example, (gmetad/data_thread.c - data_thread(), line 184 or so):

         gettimeofday(&end, NULL);
         /* Sleep somewhere between (step +/- 5sec.) */
         sleep_time = (d->step - 5) + (10 * (rand()/(float)RAND_MAX)) - (end.tv_
sec - start.tv_sec);
         if( sleep_time > 0 )
            sleep(sleep_time);
      }
   return NULL;

The sleep time of each thread that processes each data source is
randomized; the logic I assume or to stop all the threads firing
at the same time. Also the grid level stuff:

gmetad.c:         sleep_time = 10 + ((30-10)*1.0) * rand()/(RAND_MAX + 1.0);
gmetad.c:         sleep(sleep_time);

So you need to change some code.

>>> 
>>> 2. is it sufficient to lower the âcollect_everyâ and
>>> âtime_thresholdâ variables in gmond.conf to speed up the data
>>> collection or isn't it that simple?

No, it is neccesssary but not sufficient.

>>> 
>>> I want the the gmond write its output to the console to see what the
>>> collector does. So I change a value, stop and restart with âgmond -d9
>>> startâ but then get:
>>> [...]
>>> tcp_listen() on xml_port failed: Address already in use
>>> 
>>> How do I get the output without restarting the computer? (when I start
>>> gmond for the first time it works). I think I tried to restart gmetad, too,
>>> but problem remains.

Not sure about this. You can just netcat (nc host 8649) and get the output.
But changing the config requires a gmond restart. Remember gmond has debugging
modes (-d 3).

The final comments are these:

- The step size of RRD files are determined by the gmetad poll rate
  and NOT the configured polling interval of a metric in gmnod.
  So there will be some metrics recorded way too often for their semantics.

- If there is a pre-existing RRD file for a metric, changing the gmetad polling
  rate will not change the step size in the RRD file. Step size is set at
  RRD creation time and that's it. To change it, remove the RRD file.

- Faster polling wanted? Ditch all metrics except the critical ones you want
  to see.

- Put the RRD files on SAN or a tmpfs fs.

- Try 5 second polling then reduce it. My experience is that 1-2 polling
  is not possible.

- You may get data gaps when RRD needs to write some of the rolled up
  data points - at every hour for example.

Phew.

- Richard

>>> 
>>> Your help is greatly appreciated,
>>> thanks,
>>> Tom and Percy
>>> 
>>> -- 
>>> Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
>>> Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
>>> 
>>> -------------------------------------------------------------------------
>>> This SF.net email is sponsored by: Splunk Inc.
>>> Still grepping through log files to find problems?  Stop.
>>> Now Search log events and configuration files using AJAX and a browser.
>>> Download your FREE copy of Splunk now >> http://get.splunk.com/
>>> _______________________________________________
>>> Ganglia-developers mailing list
>>> Gan...@li...
>>> https://lists.sourceforge.net/lists/listinfo/ganglia-developers
>>> 

-- 

Re: [Ganglia-developers] real-time-monitoring

Scalable, distributed monitoring system for high-performance computing

Re: [Ganglia-developers] real-time-monitoring