Menu

Partial RRD data

Help
M. Litwin
2010-07-09
2013-05-20
1 2 > >> (Page 1 of 2)
  • M. Litwin

    M. Litwin - 2010-07-09

    I have set up nagiosgraph it is appears that only a small amount of the data collected is being stored in RRD in a completely random distribution. I feel like I had this problem before where the perfdata was bring purged sporadically so that the 30 second runs of insert.pl was only getting a fraction of all the collected data. I was looking at an old thread and I was advised to remove the command that was calling insert.pl with every check, but I am not sure how to do that. I tried it and when I did no data was being collected. Is there a guide on how to do this properly? Maybe I misread the INSTALL doc but if this is an obvious problem, please tell me what I did wrong and I will do it again. Thanks so much.

     
  • Matthew Wall

    Matthew Wall - 2010-07-09

    check your nagios configuration files.  ensure that you are using only the batch mode configuration as described in the 'Configuring Data Processing' section of INSTALL.

     
  • M. Litwin

    M. Litwin - 2010-07-10

    I am doing batch mode and as instructed I have service_perfdata_file commented out of nagios.cfg but it is behaving like something else is doing the processing. Very odd.

     
  • M. Litwin

    M. Litwin - 2010-07-10

    I verified that nagios.conf and nagiosgraph.conf are set up according to INSTALL. service_perfdata_command is definitely not defined in nagios.conf  I went through it 3 times but I am still only getting partial RRD data kate something else is processing the perf data. Is there anything else that might be doing this?

     
  • M. Litwin

    M. Litwin - 2010-07-12

    Me again. This is really confusing. nagiosgraph is definitely behaving like service_perfdata_command is defined in nagios.conf since I am only getting fragments of my data parsed, but I assured that it was not defined. Is there any other possible reason data is is being expunged before batch can get to it? I am totally stumped.

     
  • Matthew Wall

    Matthew Wall - 2010-07-12

    what is the service_perfdata_file_processing_interval?

    what are the heartbeat and step for one of the fragmented rrd files?

    is nagios invoking insert.pl?  if so, how often?

    are data going into the nagios perfdata.log file?

    what do you see in the nagiosgraph log from insert.pl when you set debug_insert=5?

    what configuration files have you modified, and how?

    what perl scripts have you modified, and how?

     
  • M. Litwin

    M. Litwin - 2010-07-12

    service_perfdata_file_processing_interval=30

    heartbeat is 600, step is 300

    insert.pl is only called by process-service-perfdata which is only specified in nagios.cfg as: "service_perfdata_file_processing_command=process-service-perfdata"

    perfdata.log fills up merrily

    at debug=5, and entry looks like this:

    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd(srwp01cfn014, Total_Processes, 1278975248, procs)
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd checking /usr/local/nagios/rrd/srwp01cfn014/Total_Processes___procs.rrd
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd resolutions: 600 700 775 797
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd heartbeat: 600
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd step: 300
    Mon Jul 12 22:54:47 2010 insert.pl debug labels-> = [
      'procs',
      'GAUGE',
      '183'
    ];
    Mon Jul 12 22:54:47 2010 insert.pl debug createminmax opts = {
      'labels' => [
        [
          'procs',
          'GAUGE',
          '183'
        ]
      ],
      'directory' => '/usr/local/nagios/rrd/srwp01cfn014',
      'service' => 'Total_Processes',
      'conf' => 'min'
    };
    Mon Jul 12 22:54:47 2010 insert.pl debug checkminmax(min, Total_Processes, /usr/local/nagios/rrd/srwp01cfn014, Total_Processes___procs.rrd)
    Mon Jul 12 22:54:47 2010 insert.pl debug createminmax opts = {
      'labels' => [
        [
          'procs',
          'GAUGE',
          '183'
        ]
      ],
      'directory' => '/usr/local/nagios/rrd/srwp01cfn014',
      'service' => 'Total_Processes',
      'conf' => 'max'
    };
    Mon Jul 12 22:54:47 2010 insert.pl debug checkminmax(max, Total_Processes, /usr/local/nagios/rrd/srwp01cfn014, Total_Processes___procs.rrd)
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd filenames = [
      'Total_Processes___procs.rrd'
    ];
    Mon Jul 12 22:54:47 2010 insert.pl debug createrrd datasets = [
      [
        0
      ]
    ];
    Mon Jul 12 22:54:47 2010 insert.pl debug rrdupdate(Total_Processes___procs.rrd, 1278975249, srwp01cfn014)
    Mon Jul 12 22:54:47 2010 insert.pl info runupdate dataset = [
      '/usr/local/nagios/rrd/srwp01cfn014/Total_Processes___procs.rrd',
      '1278975249:183'
    ];
    Mon Jul 12 22:54:47 2010 insert.pl debug getdebug(insert, srwp01cmi003, Partition_Free_-_/)
    Mon Jul 12 22:54:47 2010 insert.pl debug getdebug found debug_insert
    Mon Jul 12 22:54:47 2010 insert.pl debug processdata data = [
      '1278975249',
      'srwp01cmi003',
      'Partition_Free_-_/',
      'DISK OK - free space: / 783 MB (78% inode=96%):',
      '/=217MB;737;895;0;1054'
    ];

    All I have modified was the map file. Otherwise I just configured the latest tarball from sourceforge for my environment. Oh, I did change this in the cgis:
    use lib '/usr/local/nagios/nagiosgraph/etc';

    That's pretty much it I think,

     
  • M. Litwin

    M. Litwin - 2010-07-22

    Okay, I am pretty stumped here.

    Let me review what I have in nagios.cfg:

    process_performance_data=1
    #service_perfdata_command=process-service-perfdata
    service_perfdata_file=/usr/local/nagios/var/perfdata.log
    service_perfdata_file_template=$LASTSERVICECHECK$||$HOSTNAME$||$SERVICEDESC$||$SERVICEOUTPUT$||$SERVICEPERFDATA$
    service_perfdata_file_mode=a
    service_perfdata_file_processing_interval=30
    service_perfdata_file_processing_command=process-service-perfdata

    This is what my command.cfg looks like

    define command{
            command_name   process-service-perfdata
            command_line   /usr/local/nagios/nagiosgraph/insert.pl
            }
    

    Data is going into /usr/local/nagios/var/perfdata.log, I am seeing activity in nagiosgraph.log as I showed I smidge of above, but it clearly is not getting everything.

    I wish I know more about exactly how the data gets taken from nagios in batch mode so maybe I could provide better forensics, but I am at a loss.

     
  • M. Litwin

    M. Litwin - 2010-07-30

    since I have all but given up on getting this working, can someone point me to the 1.3.1 documentation so I can just revert?

     
  • Lars Jørgensen

    Lars Jørgensen - 2010-09-18

    I seem to be having the exact same problem. Did you solve it? Did reverting to 1.3.1 work?

     
  • M. Litwin

    M. Litwin - 2010-09-18

    Never got this fixed so they unded up taking this the time to move us off of nagios onto some commercial one. It's sad.

     
  • Lars Jørgensen

    Lars Jørgensen - 2010-09-18

    Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.

    I switched it to immediate mode (we monitor a little less than a thousand services, so the load is not big), and everyting made it into the RRDs right away.

     
  • M. Litwin

    M. Litwin - 2010-10-22

    Thank you very much lgord6, I will try that.

    We do have about 7000 services so I will have to be modest about what I add.

    BTW, did you file a bug for this behavior?

     
  • M. Litwin

    M. Litwin - 2010-10-23

    I have moved my configs over to immediate processing and I am still getting N/A for most of the data points in the RRD files. This is really confusing as the nagiosgraph.log has data being processed that corresponds to the checks I am running. I did notice that my checks are "falling behind schedule" by about 5-10 minutes which I am trying to resolve separately, but if I can collecting data every 10 minutes the 5 minute averages in RRD should be showing something, right?

     
  • M. Litwin

    M. Litwin - 2010-10-25

    Okay, so I moved from batch to intermediate, but the latency because so bad that I wasn't checking enough to put data in there so graphs looked all broken up and spotty. So it looks like if batch mode will not work, I may have to just find some other way to graph which is really a drag after all the time I have invested in trying to fix it. :(

     
  • Matthew Wall

    Matthew Wall - 2010-10-25

    try the following:

    1) configure to collect data for only one service on one host.  this means you must turn off perfdata processing for all other hosts and services.  use the 'process_perf_data' directive in a service definition template (turn perf data processing off globally, then enable it only for a single service).

    2) verify that data are being collected for the single host-service.
      - watch the nagios perfdata log file.  ensure that it is generated and that it is periodically cleared (by nagiosgraph)
      - watch the nagiosgraph log file with logging set to 5 for insert.  ensure that data are parsed correctly and no errors.
      - watch the RRD file.  ensure that data are added properly.
      - watch the timing across all three of these

    3) check the validity of the RRD file.
      - check the stepsize, heartbeat in the created file
      - if necessary, delete the RRD file and let nagiosgraph create a new one
      - watch the contents of the RRD file ('rrdtool dump filename')

    4) using nagiosgraph, graph the data for the single service on the single host.  ensure there are no gaps

    once you get it working for a single service on a single host, enable collection for all services on that host.  if that looks ok, enable for multiple hosts.

    finally, watch the nagios latencies as you do all of this, whether you use batch mode or immediate mode.  you should find that batch mode is the more robust way to go.  i have found that immediate mode is ok only for small sites - no more than a couple hundred services.  more than that and the latency interferes too much with the nagios monitoring.

     
  • Matthew Wall

    Matthew Wall - 2010-10-25

    > Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards
    > huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.

    it would really help to know why you get this behavior.  if insert.pl is reading only a few lines from perflog, then either it is dying prematurely, it is being killed, or something else odd is happening.

    i'll try to instrument the nagiosgraph scripts so that debugging is more explicit for more corner cases.  but i'm having a hard time helping with the diagnoses of this one because i cannot duplicate the spotty graph behavior.  i had it once on one of my nagios installations, but switching from immediate mode to batch mode cured that one.

     
  • Lars Jørgensen

    Lars Jørgensen - 2010-10-26

    it would really help to know why you get this behavior.  if insert.pl is reading only a few lines from perflog, then either it is dying prematurely, it is being killed, or something else odd is happening.

    If I knew why, I wouldn't post to this list :-)

    The logs didn't show anything, just the few services being graphed showed up in the logs. My (wild) guess is that insert.pl stops processing when it encounters a line without perfdata in it.

    i'll try to instrument the nagiosgraph scripts so that debugging is more explicit for more corner cases.  but i'm having a hard time helping with the diagnoses of this one because i cannot duplicate the spotty graph behavior.  i had it once on one of my nagios installations, but switching from immediate mode to batch mode cured that one.

    Funny, switching from batch to immediate solved the problem for us. But now our Nagios server is so heavily loaded that it falls behind. I'll try to switch over to batch processing using the strategy you outlined above.

     
  • M. Litwin

    M. Litwin - 2010-10-26

    I actually was running insert.sh manually to see the output and occasionally I saw this:

    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
    Unimplemented: POSIX::free() is C-specific, stopped at (eval 6) line 53
    that might be the problem?

     
  • Matthew Wall

    Matthew Wall - 2010-10-27

    it looks like you're getting empty perfdata and/or output from one or more plugins.  modify ngshared.pm by inserting these lines into the processdata subroutine:

    before:

    my @data = split /\|\|/, $line;
    my $debug = $config{debug};
    

    after:

    my @data = split /\|\|/, $line;
    $data[0] ||= 0;
    $data[1] ||= 'undefined_host';
    $data[2] ||= 'undefined_service';
    $data[3] ||= 'undefined_output';
    $data[4] ||= 'undefined_perfdata';
    my $debug = $config{debug};
    
     
  • Lars Jørgensen

    Lars Jørgensen - 2010-10-27

    1) configure to collect data for only one service on one host.  this means you must turn off perfdata processing for all other hosts and services.  use the 'process_perf_data' directive in a service definition template (turn perf data processing off globally, then enable it only for a single service).

    To clarify this step:

    You still ned to have "process_performance_data=1" in nagios.cfg. If you disable it globally, nothing happens.

    Then go through your templates and make sure "process_perf_data" is set to "0" for all of them. Create a new template (or modify an existing one) and have "process_perf_data" set to "1" in that template. Use that template in one or more services.

    I did this yesterday, making sure that only services providing perfomance data use that template. Now batch processing seems to work.

    This indicates that batch processing stops when it enounters a line without performance data in it.

     
  • Lars Jørgensen

    Lars Jørgensen - 2010-10-27

    it looks like you're getting empty perfdata and/or output from one or more plugins.  modify ngshared.pm by inserting these lines into the processdata subroutine

    Great! That solved my problem with spotty or missing graphs running batch mode.

     
  • M. Litwin

    M. Litwin - 2010-10-27

    process_perf_data is set to 1 for all my templates already. Also process_performance_data=1 is set in nagios.cfg. I still am not getting anything. nagiosgraph and perfdata.log seem to be getting data just fine, but the RRD files simply do not get data loaded into them.

     
  • M. Litwin

    M. Litwin - 2010-10-27

    I will also add that manual runs of insert.pl don't indicate a problem, however they do not process the data to RRD either. :-/

     
  • M. Litwin

    M. Litwin - 2010-10-27

    Ok, I have only 1 service on 1 host collecting. Here is an example of what goes in nagiosgraph.log:

    srwp01mon001:var$ tail -f nagiosgraph.log
    Wed Oct 27 19:01:51 2010 insert.pl debug insert.pl processing started
    Wed Oct 27 19:01:51 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
    Wed Oct 27 19:01:51 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
    Wed Oct 27 19:01:51 2010 insert.pl info readperfdata: empty perflog /usr/local/nagios/var/perfdata.log
    Wed Oct 27 19:01:51 2010 insert.pl debug insert.pl processing complete
    Wed Oct 27 19:02:21 2010 insert.pl debug insert.pl processing started
    Wed Oct 27 19:02:21 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
    Wed Oct 27 19:02:21 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
    Wed Oct 27 19:02:21 2010 insert.pl info readperfdata: empty perflog /usr/local/nagios/var/perfdata.log
    Wed Oct 27 19:02:21 2010 insert.pl debug insert.pl processing complete
    Wed Oct 27 19:02:51 2010 insert.pl debug insert.pl processing started
    Wed Oct 27 19:02:51 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
    Wed Oct 27 19:02:51 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
    Wed Oct 27 19:02:51 2010 insert.pl debug processdata(1)
    Wed Oct 27 19:02:51 2010 insert.pl debug getdebug(insert, srwp01sws001, Apache_Status)
    Wed Oct 27 19:02:51 2010 insert.pl debug getdebug found debug_insert
    Wed Oct 27 19:02:51 2010 insert.pl debug processdata data = [
      '1288206145',
      'srwp01sws001',
      'Apache_Status',
      'OK 0.053297 seconds response time. Idle 224, busy 32, open slots 1792',
      '224;0;0;1;31;0;0;0;0;0;1792'
    ];
    Wed Oct 27 19:02:51 2010 insert.pl warn perfdata not recognized:
    hostname:srwp01sws001
    servicedesc:Apache_Status
    output:OK 0.053297 seconds response time. Idle 224, busy 32, open slots 1792
    perfdata:224;0;0;1;31;0;0;0;0;0;1792
    Wed Oct 27 19:02:51 2010 insert.pl debug insert.pl processing complete

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.