I have set up nagiosgraph it is appears that only a small amount of the data collected is being stored in RRD in a completely random distribution. I feel like I had this problem before where the perfdata was bring purged sporadically so that the 30 second runs of insert.pl was only getting a fraction of all the collected data. I was looking at an old thread and I was advised to remove the command that was calling insert.pl with every check, but I am not sure how to do that. I tried it and when I did no data was being collected. Is there a guide on how to do this properly? Maybe I misread the INSTALL doc but if this is an obvious problem, please tell me what I did wrong and I will do it again. Thanks so much.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
check your nagios configuration files. ensure that you are using only the batch mode configuration as described in the 'Configuring Data Processing' section of INSTALL.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am doing batch mode and as instructed I have service_perfdata_file commented out of nagios.cfg but it is behaving like something else is doing the processing. Very odd.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I verified that nagios.conf and nagiosgraph.conf are set up according to INSTALL. service_perfdata_command is definitely not defined in nagios.conf I went through it 3 times but I am still only getting partial RRD data kate something else is processing the perf data. Is there anything else that might be doing this?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Me again. This is really confusing. nagiosgraph is definitely behaving like service_perfdata_command is defined in nagios.conf since I am only getting fragments of my data parsed, but I assured that it was not defined. Is there any other possible reason data is is being expunged before batch can get to it? I am totally stumped.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
insert.pl is only called by process-service-perfdata which is only specified in nagios.cfg as: "service_perfdata_file_processing_command=process-service-perfdata"
All I have modified was the map file. Otherwise I just configured the latest tarball from sourceforge for my environment. Oh, I did change this in the cgis:
use lib '/usr/local/nagios/nagiosgraph/etc';
That's pretty much it I think,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Data is going into /usr/local/nagios/var/perfdata.log, I am seeing activity in nagiosgraph.log as I showed I smidge of above, but it clearly is not getting everything.
I wish I know more about exactly how the data gets taken from nagios in batch mode so maybe I could provide better forensics, but I am at a loss.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.
I switched it to immediate mode (we monitor a little less than a thousand services, so the load is not big), and everyting made it into the RRDs right away.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have moved my configs over to immediate processing and I am still getting N/A for most of the data points in the RRD files. This is really confusing as the nagiosgraph.log has data being processed that corresponds to the checks I am running. I did notice that my checks are "falling behind schedule" by about 5-10 minutes which I am trying to resolve separately, but if I can collecting data every 10 minutes the 5 minute averages in RRD should be showing something, right?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, so I moved from batch to intermediate, but the latency because so bad that I wasn't checking enough to put data in there so graphs looked all broken up and spotty. So it looks like if batch mode will not work, I may have to just find some other way to graph which is really a drag after all the time I have invested in trying to fix it. :(
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) configure to collect data for only one service on one host. this means you must turn off perfdata processing for all other hosts and services. use the 'process_perf_data' directive in a service definition template (turn perf data processing off globally, then enable it only for a single service).
2) verify that data are being collected for the single host-service.
- watch the nagios perfdata log file. ensure that it is generated and that it is periodically cleared (by nagiosgraph)
- watch the nagiosgraph log file with logging set to 5 for insert. ensure that data are parsed correctly and no errors.
- watch the RRD file. ensure that data are added properly.
- watch the timing across all three of these
3) check the validity of the RRD file.
- check the stepsize, heartbeat in the created file
- if necessary, delete the RRD file and let nagiosgraph create a new one
- watch the contents of the RRD file ('rrdtool dump filename')
4) using nagiosgraph, graph the data for the single service on the single host. ensure there are no gaps
once you get it working for a single service on a single host, enable collection for all services on that host. if that looks ok, enable for multiple hosts.
finally, watch the nagios latencies as you do all of this, whether you use batch mode or immediate mode. you should find that batch mode is the more robust way to go. i have found that immediate mode is ok only for small sites - no more than a couple hundred services. more than that and the latency interferes too much with the nagios monitoring.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards
> huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.
it would really help to know why you get this behavior. if insert.pl is reading only a few lines from perflog, then either it is dying prematurely, it is being killed, or something else odd is happening.
i'll try to instrument the nagiosgraph scripts so that debugging is more explicit for more corner cases. but i'm having a hard time helping with the diagnoses of this one because i cannot duplicate the spotty graph behavior. i had it once on one of my nagios installations, but switching from immediate mode to batch mode cured that one.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it would really help to know why you get this behavior. if insert.pl is reading only a few lines from perflog, then either it is dying prematurely, it is being killed, or something else odd is happening.
If I knew why, I wouldn't post to this list :-)
The logs didn't show anything, just the few services being graphed showed up in the logs. My (wild) guess is that insert.pl stops processing when it encounters a line without perfdata in it.
i'll try to instrument the nagiosgraph scripts so that debugging is more explicit for more corner cases. but i'm having a hard time helping with the diagnoses of this one because i cannot duplicate the spotty graph behavior. i had it once on one of my nagios installations, but switching from immediate mode to batch mode cured that one.
Funny, switching from batch to immediate solved the problem for us. But now our Nagios server is so heavily loaded that it falls behind. I'll try to switch over to batch processing using the strategy you outlined above.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I actually was running insert.sh manually to see the output and occasionally I saw this:
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Unimplemented: POSIX::free() is C-specific, stopped at (eval 6) line 53
that might be the problem?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it looks like you're getting empty perfdata and/or output from one or more plugins. modify ngshared.pm by inserting these lines into the processdata subroutine:
1) configure to collect data for only one service on one host. this means you must turn off perfdata processing for all other hosts and services. use the 'process_perf_data' directive in a service definition template (turn perf data processing off globally, then enable it only for a single service).
To clarify this step:
You still ned to have "process_performance_data=1" in nagios.cfg. If you disable it globally, nothing happens.
Then go through your templates and make sure "process_perf_data" is set to "0" for all of them. Create a new template (or modify an existing one) and have "process_perf_data" set to "1" in that template. Use that template in one or more services.
I did this yesterday, making sure that only services providing perfomance data use that template. Now batch processing seems to work.
This indicates that batch processing stops when it enounters a line without performance data in it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it looks like you're getting empty perfdata and/or output from one or more plugins. modify ngshared.pm by inserting these lines into the processdata subroutine
Great! That solved my problem with spotty or missing graphs running batch mode.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
process_perf_data is set to 1 for all my templates already. Also process_performance_data=1 is set in nagios.cfg. I still am not getting anything. nagiosgraph and perfdata.log seem to be getting data just fine, but the RRD files simply do not get data loaded into them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have set up nagiosgraph it is appears that only a small amount of the data collected is being stored in RRD in a completely random distribution. I feel like I had this problem before where the perfdata was bring purged sporadically so that the 30 second runs of insert.pl was only getting a fraction of all the collected data. I was looking at an old thread and I was advised to remove the command that was calling insert.pl with every check, but I am not sure how to do that. I tried it and when I did no data was being collected. Is there a guide on how to do this properly? Maybe I misread the INSTALL doc but if this is an obvious problem, please tell me what I did wrong and I will do it again. Thanks so much.
check your nagios configuration files. ensure that you are using only the batch mode configuration as described in the 'Configuring Data Processing' section of INSTALL.
I am doing batch mode and as instructed I have service_perfdata_file commented out of nagios.cfg but it is behaving like something else is doing the processing. Very odd.
I verified that nagios.conf and nagiosgraph.conf are set up according to INSTALL. service_perfdata_command is definitely not defined in nagios.conf I went through it 3 times but I am still only getting partial RRD data kate something else is processing the perf data. Is there anything else that might be doing this?
Me again. This is really confusing. nagiosgraph is definitely behaving like service_perfdata_command is defined in nagios.conf since I am only getting fragments of my data parsed, but I assured that it was not defined. Is there any other possible reason data is is being expunged before batch can get to it? I am totally stumped.
what is the service_perfdata_file_processing_interval?
what are the heartbeat and step for one of the fragmented rrd files?
is nagios invoking insert.pl? if so, how often?
are data going into the nagios perfdata.log file?
what do you see in the nagiosgraph log from insert.pl when you set debug_insert=5?
what configuration files have you modified, and how?
what perl scripts have you modified, and how?
service_perfdata_file_processing_interval=30
heartbeat is 600, step is 300
insert.pl is only called by process-service-perfdata which is only specified in nagios.cfg as: "service_perfdata_file_processing_command=process-service-perfdata"
perfdata.log fills up merrily
at debug=5, and entry looks like this:
All I have modified was the map file. Otherwise I just configured the latest tarball from sourceforge for my environment. Oh, I did change this in the cgis:
use lib '/usr/local/nagios/nagiosgraph/etc';
That's pretty much it I think,
Okay, I am pretty stumped here.
Let me review what I have in nagios.cfg:
This is what my command.cfg looks like
Data is going into /usr/local/nagios/var/perfdata.log, I am seeing activity in nagiosgraph.log as I showed I smidge of above, but it clearly is not getting everything.
I wish I know more about exactly how the data gets taken from nagios in batch mode so maybe I could provide better forensics, but I am at a loss.
since I have all but given up on getting this working, can someone point me to the 1.3.1 documentation so I can just revert?
I seem to be having the exact same problem. Did you solve it? Did reverting to 1.3.1 work?
Never got this fixed so they unded up taking this the time to move us off of nagios onto some commercial one. It's sad.
Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.
I switched it to immediate mode (we monitor a little less than a thousand services, so the load is not big), and everyting made it into the RRDs right away.
Thank you very much lgord6, I will try that.
We do have about 7000 services so I will have to be modest about what I add.
BTW, did you file a bug for this behavior?
I have moved my configs over to immediate processing and I am still getting N/A for most of the data points in the RRD files. This is really confusing as the nagiosgraph.log has data being processed that corresponds to the checks I am running. I did notice that my checks are "falling behind schedule" by about 5-10 minutes which I am trying to resolve separately, but if I can collecting data every 10 minutes the 5 minute averages in RRD should be showing something, right?
Okay, so I moved from batch to intermediate, but the latency because so bad that I wasn't checking enough to put data in there so graphs looked all broken up and spotty. So it looks like if batch mode will not work, I may have to just find some other way to graph which is really a drag after all the time I have invested in trying to fix it. :(
try the following:
1) configure to collect data for only one service on one host. this means you must turn off perfdata processing for all other hosts and services. use the 'process_perf_data' directive in a service definition template (turn perf data processing off globally, then enable it only for a single service).
2) verify that data are being collected for the single host-service.
- watch the nagios perfdata log file. ensure that it is generated and that it is periodically cleared (by nagiosgraph)
- watch the nagiosgraph log file with logging set to 5 for insert. ensure that data are parsed correctly and no errors.
- watch the RRD file. ensure that data are added properly.
- watch the timing across all three of these
3) check the validity of the RRD file.
- check the stepsize, heartbeat in the created file
- if necessary, delete the RRD file and let nagiosgraph create a new one
- watch the contents of the RRD file ('rrdtool dump filename')
4) using nagiosgraph, graph the data for the single service on the single host. ensure there are no gaps
once you get it working for a single service on a single host, enable collection for all services on that host. if that looks ok, enable for multiple hosts.
finally, watch the nagios latencies as you do all of this, whether you use batch mode or immediate mode. you should find that batch mode is the more robust way to go. i have found that immediate mode is ok only for small sites - no more than a couple hundred services. more than that and the latency interferes too much with the nagios monitoring.
> Okay, I got that one figured out. When nagiosgraph is running in batch mode, it discards
> huge amounts of data. As far as I can see, it only reads a few lines of data from the perflog.
it would really help to know why you get this behavior. if insert.pl is reading only a few lines from perflog, then either it is dying prematurely, it is being killed, or something else odd is happening.
i'll try to instrument the nagiosgraph scripts so that debugging is more explicit for more corner cases. but i'm having a hard time helping with the diagnoses of this one because i cannot duplicate the spotty graph behavior. i had it once on one of my nagios installations, but switching from immediate mode to batch mode cured that one.
If I knew why, I wouldn't post to this list :-)
The logs didn't show anything, just the few services being graphed showed up in the logs. My (wild) guess is that insert.pl stops processing when it encounters a line without perfdata in it.
Funny, switching from batch to immediate solved the problem for us. But now our Nagios server is so heavily loaded that it falls behind. I'll try to switch over to batch processing using the strategy you outlined above.
I actually was running insert.sh manually to see the output and occasionally I saw this:
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Use of uninitialized value in concatenation (.) or string at /usr/local/nagios/nagiosgraph/etc/ngshared.pm line 2757.
Unimplemented: POSIX::free() is C-specific, stopped at (eval 6) line 53
that might be the problem?
it looks like you're getting empty perfdata and/or output from one or more plugins. modify ngshared.pm by inserting these lines into the processdata subroutine:
before:
after:
To clarify this step:
You still ned to have "process_performance_data=1" in nagios.cfg. If you disable it globally, nothing happens.
Then go through your templates and make sure "process_perf_data" is set to "0" for all of them. Create a new template (or modify an existing one) and have "process_perf_data" set to "1" in that template. Use that template in one or more services.
I did this yesterday, making sure that only services providing perfomance data use that template. Now batch processing seems to work.
This indicates that batch processing stops when it enounters a line without performance data in it.
Great! That solved my problem with spotty or missing graphs running batch mode.
process_perf_data is set to 1 for all my templates already. Also process_performance_data=1 is set in nagios.cfg. I still am not getting anything. nagiosgraph and perfdata.log seem to be getting data just fine, but the RRD files simply do not get data loaded into them.
I will also add that manual runs of insert.pl don't indicate a problem, however they do not process the data to RRD either. :-/
Ok, I have only 1 service on 1 host collecting. Here is an example of what goes in nagiosgraph.log:
srwp01mon001:var$ tail -f nagiosgraph.log
Wed Oct 27 19:01:51 2010 insert.pl debug insert.pl processing started
Wed Oct 27 19:01:51 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
Wed Oct 27 19:01:51 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
Wed Oct 27 19:01:51 2010 insert.pl info readperfdata: empty perflog /usr/local/nagios/var/perfdata.log
Wed Oct 27 19:01:51 2010 insert.pl debug insert.pl processing complete
Wed Oct 27 19:02:21 2010 insert.pl debug insert.pl processing started
Wed Oct 27 19:02:21 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
Wed Oct 27 19:02:21 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
Wed Oct 27 19:02:21 2010 insert.pl info readperfdata: empty perflog /usr/local/nagios/var/perfdata.log
Wed Oct 27 19:02:21 2010 insert.pl debug insert.pl processing complete
Wed Oct 27 19:02:51 2010 insert.pl debug insert.pl processing started
Wed Oct 27 19:02:51 2010 insert.pl debug getrules(/usr/local/nagios/nagiosgraph/etc/map)
Wed Oct 27 19:02:51 2010 insert.pl debug readperfdata: /usr/local/nagios/var/perfdata.log
Wed Oct 27 19:02:51 2010 insert.pl debug processdata(1)
Wed Oct 27 19:02:51 2010 insert.pl debug getdebug(insert, srwp01sws001, Apache_Status)
Wed Oct 27 19:02:51 2010 insert.pl debug getdebug found debug_insert
Wed Oct 27 19:02:51 2010 insert.pl debug processdata data = [
'1288206145',
'srwp01sws001',
'Apache_Status',
'OK 0.053297 seconds response time. Idle 224, busy 32, open slots 1792',
'224;0;0;1;31;0;0;0;0;0;1792'
];
Wed Oct 27 19:02:51 2010 insert.pl warn perfdata not recognized:
hostname:srwp01sws001
servicedesc:Apache_Status
output:OK 0.053297 seconds response time. Idle 224, busy 32, open slots 1792
perfdata:224;0;0;1;31;0;0;0;0;0;1792
Wed Oct 27 19:02:51 2010 insert.pl debug insert.pl processing complete