From: Kelly J. <kel...@gm...> - 2008-10-06 19:34:08
|
What's the best way to simulate (not schedule) downtime in nagios? I want to "pretend" a service is down for a certain amount of time to see what alerts nagios sends, etc. I've come up w/ two bad ways to do this: % Edit the config file to change the test to "check_dummy". I want to run these "fire drills" via cron, and editing a file and restarting nagios seems a little ugly. % Submit a passive check saying the service is down, and reschedule the next check 4 hours later, so the service is 'down' for 4 hours. This can be done using the nagios named pipe, so it's easy to cron. Problem: doing things this way suppresses the alerts (when you don't test a service, it doesn't send an alert). Thoughts? -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile. |
From: Andy S. <and...@ne...> - 2008-10-06 20:09:38
|
Hi Kelly, When I've done this in the past, for network services (e.g. http/smtp checks) I've actually blocked the target port on the Nagios server, which gives a better simulation that the service is down (e.g. for HTTP checks, block the Nagios server's outbound port 80.) This works for us because as well as the router firewalls, each server runs a local software firewall, so it's easy to block outbound packets to a particular port on the Nagios server without affecting the service itself, simulating the effect of a network/service failure. However when it comes to checks such as disk space, it can be a bit trickier! I've done things like changing the thresholds for a failure (e.g. if disk space is currently 15% capacity, I set my warning alert to be 20%, restart Nagios and wait for the alerts to come, and the same for critical, then reset back to 90% when complete) and I have done before as you suggested, change the service's check and retry intervals in Nagios to be something lengthy (e.g. an hour) then submit a passive 'failure' check result and wait until Nagios re-checks the service - this method also checks how Nagios alerts you when the service returns to OK. Hope this helps, it'd be interesting to hear how/if others do it! Andy Kelly Jones wrote: > What's the best way to simulate (not schedule) downtime in nagios? > > I want to "pretend" a service is down for a certain amount of time to > see what alerts nagios sends, etc. > > I've come up w/ two bad ways to do this: > > % Edit the config file to change the test to "check_dummy". I want to > run these "fire drills" via cron, and editing a file and restarting > nagios seems a little ugly. > > % Submit a passive check saying the service is down, and reschedule > the next check 4 hours later, so the service is 'down' for 4 > hours. This can be done using the nagios named pipe, so it's easy to > cron. Problem: doing things this way suppresses the alerts (when you > don't test a service, it doesn't send an alert). > > Thoughts? > > |
From: Hugo v. d. K. <hvd...@va...> - 2008-10-07 00:09:17
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kelly Jones wrote: > What's the best way to simulate (not schedule) downtime in nagios? Why do you care to do this in a live environment? I think you should considere these point: 1. Duplicate your production environment (nagios server) into a test environment and play all you want. 2. Tell us what you suspect is not working and what you think this simulation will tell you to solve it. Hugo. - -- hvd...@va... http://hugo.vanderkooij.org/ PGP/GPG? Use: http://hugo.vanderkooij.org/0x58F19981.asc A: Yes. >Q: Are you sure? >>A: Because it reverses the logical flow of conversation. >>>Q: Why is top posting frowned upon? Bored? Click on http://spamornot.org/ and rate those images. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFI6oOgBvzDRVjxmYERAoNkAJ9HHarh6umEg5XrZxwEvTRk3twQaACgg2bD 821LCtG8/mhddhBuqo1vipE= =7YF2 -----END PGP SIGNATURE----- |
From: Tom T. <th...@gm...> - 2008-10-07 02:17:24
|
On Oct 06 18:57, Kelly Jones wrote: > Thanks, Tom. > > Yes, I'm trying to simulate a host/service outage, not scheduled downtime. > > The problem w/ submitting a passive check is that the next ACTIVE check will > invalidate it. Example: you tell nagios that machine foo is down. That's soft > alert 1, not enough to generate any emails. Nagios then active checks foo and > sees that it's up. Of course, you can submit another passive check, but > you'll ping-pong (flap) between up and down states. OK, so it sounds like you want to be able to have Nagios temporarily stop managing the service check scheduling for this service, long enough for you to inject some bogus results. Seems like rescheduling the next active check (SCHEDULE_FORCED_SVC_CHECK) would do the right thing as far as pushing the next scheduled check into the future. Or maybe you want to disable active checks for the service (DISABLE_SVC_CHECK), run your simulation, and then re-enable them...? -tt |
From: Tom T. <th...@gm...> - 2008-10-07 02:19:45
|
On Oct 06 12:29, Kelly Jones wrote: > What's the best way to simulate (not schedule) downtime in nagios? > > I want to "pretend" a service is down for a certain amount of time to > see what alerts nagios sends, etc. Just to clarify, are you trying to simulate a service outage (as opposed to simulating a scheduled downtime) so you can test alerts, and perhaps notifications, in order to validate your configuration? > I've come up w/ two bad ways to do this: > > % Edit the config file to change the test to "check_dummy". I want to > run these "fire drills" via cron, and editing a file and restarting > nagios seems a little ugly. > > % Submit a passive check saying the service is down, and reschedule > the next check 4 hours later, so the service is 'down' for 4 > hours. This can be done using the nagios named pipe, so it's easy to > cron. Problem: doing things this way suppresses the alerts (when you > don't test a service, it doesn't send an alert). > > Thoughts? I use something similar to the second method to do ad hoc validation of alerts/notifications, by submitting passive results via an external command, though without diddling the service check scheduling. I'm a little confused by your last statement though... If you're only submitting a single passive check and then rescheduling the next check, of course there will be no alerts (and you'll likely never reach $max_check_attempts) - is there some reason you can't submit multiple passive check results? -tt |
From: Kelly J. <kel...@gm...> - 2008-10-07 02:18:18
|
Thanks, Tom. Yes, I'm trying to simulate a host/service outage, not scheduled downtime. The problem w/ submitting a passive check is that the next ACTIVE check will invalidate it. Example: you tell nagios that machine foo is down. That's soft alert 1, not enough to generate any emails. Nagios then active checks foo and sees that it's up. Of course, you can submit another passive check, but you'll ping-pong (flap) between up and down states. -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile. On 10/6/08, Tom Throckmorton <th...@gm...> wrote: > On Oct 06 12:29, Kelly Jones wrote: >> What's the best way to simulate (not schedule) downtime in nagios? >> >> I want to "pretend" a service is down for a certain amount of time to >> see what alerts nagios sends, etc. > > Just to clarify, are you trying to simulate a service outage (as opposed to > simulating a scheduled downtime) so you can test alerts, and perhaps > notifications, in order to validate your configuration? > >> I've come up w/ two bad ways to do this: >> >> % Edit the config file to change the test to "check_dummy". I want to >> run these "fire drills" via cron, and editing a file and restarting >> nagios seems a little ugly. >> >> % Submit a passive check saying the service is down, and reschedule >> the next check 4 hours later, so the service is 'down' for 4 >> hours. This can be done using the nagios named pipe, so it's easy to >> cron. Problem: doing things this way suppresses the alerts (when you >> don't test a service, it doesn't send an alert). >> >> Thoughts? > > I use something similar to the second method to do ad hoc validation of > alerts/notifications, by submitting passive results via an external command, > though without diddling the service check scheduling. I'm a little confused > by > your last statement though... > > If you're only submitting a single passive check and then rescheduling the > next > check, of course there will be no alerts (and you'll likely never reach > $max_check_attempts) - is there some reason you can't submit multiple > passive > check results? > > -tt |
From: Kelly J. <kel...@gm...> - 2008-10-08 03:30:36
|
On 10/6/08, Tom Throckmorton <th...@gm...> wrote: > On Oct 06 18:57, Kelly Jones wrote: >> Thanks, Tom. >> >> Yes, I'm trying to simulate a host/service outage, not scheduled downtime. >> >> The problem w/ submitting a passive check is that the next ACTIVE check >> will >> invalidate it. Example: you tell nagios that machine foo is down. That's >> soft >> alert 1, not enough to generate any emails. Nagios then active checks foo >> and >> sees that it's up. Of course, you can submit another passive check, but >> you'll ping-pong (flap) between up and down states. > > OK, so it sounds like you want to be able to have Nagios temporarily stop > managing the service check scheduling for this service, long enough for you > to > inject some bogus results. Seems like rescheduling the next active check > (SCHEDULE_FORCED_SVC_CHECK) would do the right thing as far as pushing the > next > scheduled check into the future. Or maybe you want to disable active checks > for the service (DISABLE_SVC_CHECK), run your simulation, and then re-enable > them...? I may've done it wrong, but SCHEDULE_FORCED_SVC_CHECK means that nagios won't send any alerts at all. Basically, messing with nagios' check schedule also screws up its notification schedule. And, since I'm testing notifications, that's not useful. I've written several nagios tests myself, and they're all in one Perl program (each subroutine = one test). For these, simulating downtime is easy. The script reads downtime from a file and automatically exits w/ 1 or 2 during downtime instead of running the subroutine. I'm tempted to run ALL nagios tests in a wrapper, but that seems so ugly for such a simple? problem. -- We're just a Bunch Of Regular Guys, a collective group that's trying to understand and assimilate technology. We feel that resistance to new ideas and technology is unwise and ultimately futile. |
From: Tom T. <th...@gm...> - 2008-10-08 14:29:27
|
On Oct 07 20:30, Kelly Jones wrote: > On 10/6/08, Tom Throckmorton <th...@gm...> wrote: > > On Oct 06 18:57, Kelly Jones wrote: > >> Thanks, Tom. > >> > >> Yes, I'm trying to simulate a host/service outage, not scheduled downtime. > >> > >> The problem w/ submitting a passive check is that the next ACTIVE check > >> will > >> invalidate it. Example: you tell nagios that machine foo is down. That's > >> soft > >> alert 1, not enough to generate any emails. Nagios then active checks foo > >> and > >> sees that it's up. Of course, you can submit another passive check, but > >> you'll ping-pong (flap) between up and down states. > > > > OK, so it sounds like you want to be able to have Nagios temporarily stop > > managing the service check scheduling for this service, long enough for you > > to > > inject some bogus results. Seems like rescheduling the next active check > > (SCHEDULE_FORCED_SVC_CHECK) would do the right thing as far as pushing the > > next > > scheduled check into the future. Or maybe you want to disable active checks > > for the service (DISABLE_SVC_CHECK), run your simulation, and then re-enable > > them...? > > I may've done it wrong, but SCHEDULE_FORCED_SVC_CHECK means that > nagios won't send any alerts at all. This command manipulates the check scheduling queue for _active_ checks; it has no direct impact on alerts: http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=129 ...so forcing a service check to some time in the future will delay the active checks, but you can still submit passive checks (and generate alerts, assuming the result you're submitting is different than the current state of the service) > Basically, messing with nagios' check schedule also screws up its > notification schedule. > > And, since I'm testing notifications, that's not useful. I must be missing something here. If, for example, I do the following for a given service which is currently OK, and for which active checks are normally accepted: - delay the next check via SCHEDULE_FORCED_SVC_CHECK now + 1 hour - submit a passive result with a state of CRITICAL (PROCESS_SERVICE_CHECK_RESULT) x $max_check_attempts As expected, I see: - an alert for each result I've submitted - the status changes to SOFT/CRITICAL after the first result, and HARD/CRITICAL after $max_check_attempts has been reached - a notification about the problem The next scheduled check remains at the time + 1 hour. If I submit an OK result, the status changes from CRITICAL to OK, and I get a recovery notification. And I can repeat this as often as I like within the time before the next scheduled active check. How is this different than what you're trying to achieve? -tt > I've written several nagios tests myself, and they're all in one Perl > program (each subroutine = one test). For these, simulating downtime > is easy. The script reads downtime from a file and automatically > exits w/ 1 or 2 during downtime instead of running the subroutine. > > I'm tempted to run ALL nagios tests in a wrapper, but that seems so > ugly for such a simple? problem. > > -- > We're just a Bunch Of Regular Guys, a collective group that's trying > to understand and assimilate technology. We feel that resistance to > new ideas and technology is unwise and ultimately futile. |