Re: [Nagios-devel] Re: Percieved problem with host checks

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

It is my personal opinion that if a host should be considered to be 
in an "UP" state, at least one of the services you're monitoring 
should be in an OK state (or at least change between non-OK states 
when the host recovers).  That model should work for most everyone 
using Nagios.

However, if you really want the status of the host re-verified after 
service checks even if all services associated with the host stay in 
a non-OK state when it recovers, enable the aggressive host check 
option in the main config file.  Performance will suffer if you 
enable this option, but that's a tradeoff you'll have to be willing 
to accept.

A snippet from base/checks.c starting at line 888:
---

		/*******************************************/
		/******* SERVICE CHECK PROBLEM LOGIC *******/
		/*******************************************/

		/* hey, something's not working quite like it should... */
		else{

			/* reset the recovery notification flag (it may get set again 
though) */
			temp_service->no_recovery_notification=FALSE;

			/* check the route to the host if its supposed to be up right 
now... */
			if(temp_host->status==HOST_UP)
				route_result=verify_route_to_host(temp_host);

			/* else the host is either down or unreachable, so recheck it if 
necessary */
			else{

				/* we're using agressive host checking, so really do recheck the 
host... */
				if(use_aggressive_host_checking==TRUE)
					route_result=verify_route_to_host(temp_host);

				/* the service wobbled between non-OK states, so check the 
host... */
				else if(state_change==TRUE && temp_service-
>last_hard_state!=STATE_OK)
					route_result=verify_route_to_host(temp_host);

				/* else fake the host check, but (possibly) resend host 
notifications to contacts... */
				else{

---

On 18 Sep 2002 at 21:44, SyBase wrote:

> you are completely missing my point.. Ping (icmp) is controlled by the 
> OS. If your OS is down (i.e. never booted) icmp will not come back, and 
> neither will your service. But however if the host (operating system) 
> booted but for some reason services were never engaged then the problem 
> is completely different. Simply stating that all nagios cares about is 
> services and that if you have no available services (for which you 
> monitor) then the box is down, then what is the point in the host check 
> to begin with? Lets remove that completely and make people add their own 
> icmp service if they want to check that. I hope you see my point. I 
> would think accuracy in showing what is really the happening (i.e. not 
> saying host down when the host is really up) would be very important. If 
> you do not agree, that is fine.. Simply a suggestion.
> 
> Kenneth.ray wrote:
> 
> > Dear Sir,
> > thank you for your email however, I do believe
> > actually your test is flawed. you are under the premise that
> > the host is the important piece to your network. but actually
> > the service running on the host is the most important issue.
> > what good is a host that has no services available,
> > ping is a really basic service and only helps in determing
> > that the network interface is accessable. In some cases
> > it is quite possible for the ping to work and the box be down.
> > If this is a real problem for you, add a seperate service called
> > "alive " "pingable" or something related, and run the ping as a
> > service ,this will change your host to be "up" even if no services
> > are available from it. But Again, in my own humble opinion. this
> > proves nothing other than, you can ping the interface, your not
> > even pinging the server, your pinging the network card which
> > is hooked to the box.
> >
> > think of the logic in this sense, if a host is not a actual entity
> > but really a container/conduit to your services. so logic would
> > dictate that regardless if the entity for serving your services is
> > available, the real issue is not the entity but the service provided
> > by the container.  IMHO you can actually replace the word "host" with
> > container and the logic of netsaint still holds up. Netsaint is 
> > service based
> > not server based. and uses the logic of, "what good is a host without
> > something running on it?" and though ping is a good conduit. for 
> > determining
> > if the network card is accessable from the network for a
> > particular server, the only thing this proves is that the network card is
> > accessable by the network. I personally have had a server that was 
> > pingable
> > but no services were available because the system was pegged. the network
> > interface was actually the only thing responding on one system, and 
> > after being
> > physically infront of the system i could see why.
> > So i would have to strongly disagree with you on the statement "the 
> > logic is flawed"
> > using your same situation
> > you have a particular problem which brings down the box and causes a 
> > reboot.
> > the service that you run on this box never comes up. its the only one 
> > you are
> > monitoring,  but lets say the host reports as up( cause you set up 
> > another service called "alive"
> > like in my suggestion), because you can ping it. lets say
> > that this system is your ftp server, ftp is not working, you try 
> > telnetting and still
> > that is not working either.
> > 'you ping
> > you ping
> > you ping
> > nice!!! is the box up? are the services available? can you TRUELY say
> > the container is in a state that would allow the service to pass through?
> > are the people accessing this ftp server able to do so? nope guess 
> > not, so for
> > all intents and purposes, your host is totally unusable.  so is it up? 
> > that is the
> > logic of netsaint. the host is not the thing netsaint/nagios cares 
> > about, it is only a conduit
> > to the service you need. a toolbox to hold your tools, if the toolbox 
> > is empty can you
> > work? the fact that the toolbox exists IS important, cause that is 
> > where you
> > keep your tools and without your tools nothing gets done. can you use 
> > your toolbox
> > as a tool? sure, but only if you redefine what a tool is.. there is 
> > the logic you seek, not
> > flawed but perfectly logical.
> > if you want to emphasize your host as important all you need do is 
> > create a service called "alive"
> > give it the same parameters as your host, and you host will 
> > "magically" be considered up.
> >
> > hope this helps in understanding the logic.
> > Ken
> >
> >>  
> >> --__--__--
> >>
> >> Message: 1
> >> Date: Sun, 15 Sep 2002 02:40:41 -0500
> >> From: SyBase <sy...@va...>
> >> To: Russell Scibetti <ru...@qu...>
> >> CC: John Fox <jj...@mi...>,  nag...@li...,
> >>  nag...@li...
> >> Subject: Re: [Nagios-devel] Re: [Nagios-users] Perceived problem with 
> >> host
> >>  checks
> >>
> >> This may be by design, but IMHO this logic is flawed. Say you have a
> >> problem with a particular host service that first caused the box to
> >> reboot (setting host state to down) then when the box comes back up the
> >> service fails to start. Your monitoring tool will falsely continue to
> >> report the host as down when this is not the case. If shooting for
> >> accuracy is the idea (which I would think it would be) then maybe this
> >> should be changed?
> >>
> >> Russell Scibetti wrote:
> >>
> >> > We were confused by this at first too, but believe it or not, the
> >> > behavior you saw is what is expected.  You said that when you
> >> > eventually turned HTTP back on, both the host and service came back
> >> > up.  The way the nagios logic works is:
> >> >
> >> > 1.  check the service - if it fails...
> >> > 2.  check the host - it it fails (incl. the retries)...
> >> > 3.  host and service are now in a Hard non-OK state
> >> > 4.  Wait the service's normal_check_interval
> >> > 5.  Run the service (NOT the host) check
> >> > 6.  If the service is still down, then the host must still be down.
> >> > 7.  Wait the service's check interval.....repeat endlessly
> >> >
> >> > The Nagios logic appears to be "well, if the host is down, we can tell
> >> > its back up when any of the services are running again" - similar
> >> > logic to "if a service is running, the host must be fine."  Also,
> >> > you'll see there is no check_interval for hosts.  This is because it
> >> > uses the service checks as the basis for the monitoring logic.
> >> >
> >> > This is at least what we can tell.  If someone know's something else,
> >> > please share.
> >> >
> >> > -Russell Scibetti
> >> >
> >> > John Fox wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I'm configuring a Nagios 1.0b4 installation.  It's the first time 
> >> I've
> >> >> used this product, and I've run into somewhat of a stumbling block.
> >> >>
> >> >> Both hosts used in my tests are running FreeBSD 4.6-STABLE and nagios
> >> >> is installed via the ports system.
> >> >>
> >> >> That said, here are the details:
> >> >>
> >> >> I've configured nagios to do host checks for host A" and service
> >> >> checks for HTTPD on A.
> >> >>
> >> >> I start HTTPD on host A and fire up nagios (in daemon mode) on 
> >> host B.
> >> >>
> >> >> Everything is fine.  Host and service are both marked up UP.
> >> >>
> >> >> I use ipfw to disable ICMP on host A. This is done with the intent of
> >> >> provoking a host check, knowing that the host-check test makes use of
> >> >> ping.
> >> >>
> >> >> Host continues to remain marked as up.  This makes sense to me, given
> >> >> that HTTPD is still running and accessible there.
> >> >>
> >> >> I kill HTTPD on A.
> >> >> Both host and service become marked as 'down' and I begin to
> >> >> receive problem notifications.
> >> >>
> >> >> I enable ICMP on A, knowing that the host-check-alive command
> >> >> makes us of 'check_ping' plugin, and expecting that host A will
> >> >> soon be marked as 'UP'.
> >> >>
> >> >> But that does not happen; the host continues to be marked as down.  I
> >> >> watch the various status screen and see multiple host tests
> >> >> performed. I recieve multiple problem notifications.
> >> >>
> >> >> I'm flummoxed by this, and login to host B (the nagios machine) and
> >> >> veryify that I can ping A from there.  I can.  I then run
> >> >> check-host-alive's "check_ping" plugin from the command line.  It
> >> >> instantly returns with a "PING OK" response. (Note: I used the exact
> >> >> same command structure as nagios would -- I took it from the
> >> >> 'check-host-alive' definition found in 'checkcommands.cfg'.)
> >> >>
> >> >> Yet the 'Host Information' pages shows the Status info as
> >> >> "Critical -- Plugin timed out after 10 seconds".
> >> >>
> >> >> So to all appearances, nagios and I are getting different results
> >> >> from the exact same command line.  I don't believe this is what's
> >> >> really going on, because it seems absurd to me.  So I go to the FAQ.
> >> >>
> >> >> I see a question that seems to apply: "Hosts are incorrectly listed
> >> >> as being DOWN or UNREACHABLE".  But after reading it, I'm not sure
> >> >> that it does apply.
> >> >>
> >> >> The way I read it, nagios didn't perform any host checks on A until
> >> >> A's HTTPD went down.  Makes sense.
> >> >>
> >> >> At which point a host check is performed -- if the host check doesn't
> >> >> return 'OK', it is run again and again until it has made
> >> >> max_check_attempts (from the host definition) attemps OR recieved
> >> >> an "OK' response.
> >> >>
> >> >> My max_check_attempts is set to 3.  But in observing the various
> >> >> status screen, I saw the "Last Status Check" value changing every 3
> >> >> minutes.  In the course of this test, I allowed the downtime to reach
> >> >> 46 minutes, which to me indicates that 15 host checks were run.
> >> >> Obviously, this is a much larger number than 3.  And certainly it
> >> >> seems that the plugin never recieved an 'OK' response.  This is quite
> >> >> a conundrum to me!
> >> >>
> >> >> I then restarted HTTPD on host A.  Within three minutes, this service
> >> >> was once again marked as 'UP' and the host, too, was again marked as
> >> >> 'UP', with the 'Host Information' pages "Status Information" field
> >> >> reading "PING OK...".
> >> >>
> >> >> On the off chance that my IPFW/ping machinations were somehow causing
> >> >> wierdness, I repeated the same basic experiment, but rather than
> >> >> disabling ICMP, I ifconfig'd my network card down.  And rather than
> >> >> re-enabling ICMP, I ifconfig'd the interface back up.  This resulted
> >> >> in the same behavior as the previous test.
> >> >>
> >> >> I don't see this as a major issue, given that a successful service
> >> >> check causes the host to be again considered 'UP'.  But it 
> >> troubles me
> >> >> to not understand the behavior I'm seeing, as I'm simply unable to
> >> >> account for it.
> >> >>
> >> >> Any advice or thoughts would be very much welcomed!
> >> >>
> >> >>
> >> >> Thanks in advance,
> >> >>
> >> >>
> >> >> John
> >> >>
> >> >
> >>
> >> --__--__--
> >>
> >> _______________________________________________
> >> Nagios-devel mailing list
> >> Nag...@li...
> >> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> >>
> >> End of Nagios-devel Digest
> >>
> >-- 
> >this message and any attachments are confidential to the ordinary user of
> >the e-mail address to which it was addressed and may also be privilaged. 
> >If you receive this message in error please immediately delete it and all
> > copies of it from your system, destroy any hard copies of it and notify 
> >the sender. You must not, directly or indirectly, use, disclose,distribute,
> >print or copy any part of this message if you are not the intended 
> >resipient. Internet communications cannot be guaranteed to be secure 
> >or error-free as information could be intercepted, corrupted, lost, 
> >arrive late or contain viruses or unauthorized amendments. The sender,
> >therfore, does not accept liability for any errors or ommission in the
> >content of this message or any damage or other consequences arising as 
> >a result of Internet Transmition. Opinions, conclusions and/or
> >other information in this transmitted messxage that do not relate 
> >to official business of CITIGROUP shall be understood as neither given 
> >nor endorsed by CitiGroup.
> >
> >  
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.NET email is sponsored by: AMD - Your access to the experts
> on Hammer Technology! Open Source & Linux Developers, register now
> for the AMD Developer Symposium. Code: EX8664
> http://www.developwithamd.com/developerlab
> _______________________________________________
> Nagios-devel mailing list
> Nag...@li...
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

Ethan Galstad,
Nagios Developer
---
Email: na...@na...
Website: http://www.nagios.org

Re: [Nagios-devel] Re: Percieved problem with host checks

Nagios network monitoring software is enterprise server monitoring

Re: [Nagios-devel] Re: Percieved problem with host checks