From: Ethan G. <na...@na...> - 2002-09-19 03:12:02
|
It is my personal opinion that if a host should be considered to be in an "UP" state, at least one of the services you're monitoring should be in an OK state (or at least change between non-OK states when the host recovers). That model should work for most everyone using Nagios. However, if you really want the status of the host re-verified after service checks even if all services associated with the host stay in a non-OK state when it recovers, enable the aggressive host check option in the main config file. Performance will suffer if you enable this option, but that's a tradeoff you'll have to be willing to accept. A snippet from base/checks.c starting at line 888: --- /*******************************************/ /******* SERVICE CHECK PROBLEM LOGIC *******/ /*******************************************/ /* hey, something's not working quite like it should... */ else{ /* reset the recovery notification flag (it may get set again though) */ temp_service->no_recovery_notification=FALSE; /* check the route to the host if its supposed to be up right now... */ if(temp_host->status==HOST_UP) route_result=verify_route_to_host(temp_host); /* else the host is either down or unreachable, so recheck it if necessary */ else{ /* we're using agressive host checking, so really do recheck the host... */ if(use_aggressive_host_checking==TRUE) route_result=verify_route_to_host(temp_host); /* the service wobbled between non-OK states, so check the host... */ else if(state_change==TRUE && temp_service- >last_hard_state!=STATE_OK) route_result=verify_route_to_host(temp_host); /* else fake the host check, but (possibly) resend host notifications to contacts... */ else{ --- On 18 Sep 2002 at 21:44, SyBase wrote: > you are completely missing my point.. Ping (icmp) is controlled by the > OS. If your OS is down (i.e. never booted) icmp will not come back, and > neither will your service. But however if the host (operating system) > booted but for some reason services were never engaged then the problem > is completely different. Simply stating that all nagios cares about is > services and that if you have no available services (for which you > monitor) then the box is down, then what is the point in the host check > to begin with? Lets remove that completely and make people add their own > icmp service if they want to check that. I hope you see my point. I > would think accuracy in showing what is really the happening (i.e. not > saying host down when the host is really up) would be very important. If > you do not agree, that is fine.. Simply a suggestion. > > Kenneth.ray wrote: > > > Dear Sir, > > thank you for your email however, I do believe > > actually your test is flawed. you are under the premise that > > the host is the important piece to your network. but actually > > the service running on the host is the most important issue. > > what good is a host that has no services available, > > ping is a really basic service and only helps in determing > > that the network interface is accessable. In some cases > > it is quite possible for the ping to work and the box be down. > > If this is a real problem for you, add a seperate service called > > "alive " "pingable" or something related, and run the ping as a > > service ,this will change your host to be "up" even if no services > > are available from it. But Again, in my own humble opinion. this > > proves nothing other than, you can ping the interface, your not > > even pinging the server, your pinging the network card which > > is hooked to the box. > > > > think of the logic in this sense, if a host is not a actual entity > > but really a container/conduit to your services. so logic would > > dictate that regardless if the entity for serving your services is > > available, the real issue is not the entity but the service provided > > by the container. IMHO you can actually replace the word "host" with > > container and the logic of netsaint still holds up. Netsaint is > > service based > > not server based. and uses the logic of, "what good is a host without > > something running on it?" and though ping is a good conduit. for > > determining > > if the network card is accessable from the network for a > > particular server, the only thing this proves is that the network card is > > accessable by the network. I personally have had a server that was > > pingable > > but no services were available because the system was pegged. the network > > interface was actually the only thing responding on one system, and > > after being > > physically infront of the system i could see why. > > So i would have to strongly disagree with you on the statement "the > > logic is flawed" > > using your same situation > > you have a particular problem which brings down the box and causes a > > reboot. > > the service that you run on this box never comes up. its the only one > > you are > > monitoring, but lets say the host reports as up( cause you set up > > another service called "alive" > > like in my suggestion), because you can ping it. lets say > > that this system is your ftp server, ftp is not working, you try > > telnetting and still > > that is not working either. > > 'you ping > > you ping > > you ping > > nice!!! is the box up? are the services available? can you TRUELY say > > the container is in a state that would allow the service to pass through? > > are the people accessing this ftp server able to do so? nope guess > > not, so for > > all intents and purposes, your host is totally unusable. so is it up? > > that is the > > logic of netsaint. the host is not the thing netsaint/nagios cares > > about, it is only a conduit > > to the service you need. a toolbox to hold your tools, if the toolbox > > is empty can you > > work? the fact that the toolbox exists IS important, cause that is > > where you > > keep your tools and without your tools nothing gets done. can you use > > your toolbox > > as a tool? sure, but only if you redefine what a tool is.. there is > > the logic you seek, not > > flawed but perfectly logical. > > if you want to emphasize your host as important all you need do is > > create a service called "alive" > > give it the same parameters as your host, and you host will > > "magically" be considered up. > > > > hope this helps in understanding the logic. > > Ken > > > >> > >> --__--__-- > >> > >> Message: 1 > >> Date: Sun, 15 Sep 2002 02:40:41 -0500 > >> From: SyBase <sy...@va...> > >> To: Russell Scibetti <ru...@qu...> > >> CC: John Fox <jj...@mi...>, nag...@li..., > >> nag...@li... > >> Subject: Re: [Nagios-devel] Re: [Nagios-users] Perceived problem with > >> host > >> checks > >> > >> This may be by design, but IMHO this logic is flawed. Say you have a > >> problem with a particular host service that first caused the box to > >> reboot (setting host state to down) then when the box comes back up the > >> service fails to start. Your monitoring tool will falsely continue to > >> report the host as down when this is not the case. If shooting for > >> accuracy is the idea (which I would think it would be) then maybe this > >> should be changed? > >> > >> Russell Scibetti wrote: > >> > >> > We were confused by this at first too, but believe it or not, the > >> > behavior you saw is what is expected. You said that when you > >> > eventually turned HTTP back on, both the host and service came back > >> > up. The way the nagios logic works is: > >> > > >> > 1. check the service - if it fails... > >> > 2. check the host - it it fails (incl. the retries)... > >> > 3. host and service are now in a Hard non-OK state > >> > 4. Wait the service's normal_check_interval > >> > 5. Run the service (NOT the host) check > >> > 6. If the service is still down, then the host must still be down. > >> > 7. Wait the service's check interval.....repeat endlessly > >> > > >> > The Nagios logic appears to be "well, if the host is down, we can tell > >> > its back up when any of the services are running again" - similar > >> > logic to "if a service is running, the host must be fine." Also, > >> > you'll see there is no check_interval for hosts. This is because it > >> > uses the service checks as the basis for the monitoring logic. > >> > > >> > This is at least what we can tell. If someone know's something else, > >> > please share. > >> > > >> > -Russell Scibetti > >> > > >> > John Fox wrote: > >> > > >> >> Hello, > >> >> > >> >> I'm configuring a Nagios 1.0b4 installation. It's the first time > >> I've > >> >> used this product, and I've run into somewhat of a stumbling block. > >> >> > >> >> Both hosts used in my tests are running FreeBSD 4.6-STABLE and nagios > >> >> is installed via the ports system. > >> >> > >> >> That said, here are the details: > >> >> > >> >> I've configured nagios to do host checks for host A" and service > >> >> checks for HTTPD on A. > >> >> > >> >> I start HTTPD on host A and fire up nagios (in daemon mode) on > >> host B. > >> >> > >> >> Everything is fine. Host and service are both marked up UP. > >> >> > >> >> I use ipfw to disable ICMP on host A. This is done with the intent of > >> >> provoking a host check, knowing that the host-check test makes use of > >> >> ping. > >> >> > >> >> Host continues to remain marked as up. This makes sense to me, given > >> >> that HTTPD is still running and accessible there. > >> >> > >> >> I kill HTTPD on A. > >> >> Both host and service become marked as 'down' and I begin to > >> >> receive problem notifications. > >> >> > >> >> I enable ICMP on A, knowing that the host-check-alive command > >> >> makes us of 'check_ping' plugin, and expecting that host A will > >> >> soon be marked as 'UP'. > >> >> > >> >> But that does not happen; the host continues to be marked as down. I > >> >> watch the various status screen and see multiple host tests > >> >> performed. I recieve multiple problem notifications. > >> >> > >> >> I'm flummoxed by this, and login to host B (the nagios machine) and > >> >> veryify that I can ping A from there. I can. I then run > >> >> check-host-alive's "check_ping" plugin from the command line. It > >> >> instantly returns with a "PING OK" response. (Note: I used the exact > >> >> same command structure as nagios would -- I took it from the > >> >> 'check-host-alive' definition found in 'checkcommands.cfg'.) > >> >> > >> >> Yet the 'Host Information' pages shows the Status info as > >> >> "Critical -- Plugin timed out after 10 seconds". > >> >> > >> >> So to all appearances, nagios and I are getting different results > >> >> from the exact same command line. I don't believe this is what's > >> >> really going on, because it seems absurd to me. So I go to the FAQ. > >> >> > >> >> I see a question that seems to apply: "Hosts are incorrectly listed > >> >> as being DOWN or UNREACHABLE". But after reading it, I'm not sure > >> >> that it does apply. > >> >> > >> >> The way I read it, nagios didn't perform any host checks on A until > >> >> A's HTTPD went down. Makes sense. > >> >> > >> >> At which point a host check is performed -- if the host check doesn't > >> >> return 'OK', it is run again and again until it has made > >> >> max_check_attempts (from the host definition) attemps OR recieved > >> >> an "OK' response. > >> >> > >> >> My max_check_attempts is set to 3. But in observing the various > >> >> status screen, I saw the "Last Status Check" value changing every 3 > >> >> minutes. In the course of this test, I allowed the downtime to reach > >> >> 46 minutes, which to me indicates that 15 host checks were run. > >> >> Obviously, this is a much larger number than 3. And certainly it > >> >> seems that the plugin never recieved an 'OK' response. This is quite > >> >> a conundrum to me! > >> >> > >> >> I then restarted HTTPD on host A. Within three minutes, this service > >> >> was once again marked as 'UP' and the host, too, was again marked as > >> >> 'UP', with the 'Host Information' pages "Status Information" field > >> >> reading "PING OK...". > >> >> > >> >> On the off chance that my IPFW/ping machinations were somehow causing > >> >> wierdness, I repeated the same basic experiment, but rather than > >> >> disabling ICMP, I ifconfig'd my network card down. And rather than > >> >> re-enabling ICMP, I ifconfig'd the interface back up. This resulted > >> >> in the same behavior as the previous test. > >> >> > >> >> I don't see this as a major issue, given that a successful service > >> >> check causes the host to be again considered 'UP'. But it > >> troubles me > >> >> to not understand the behavior I'm seeing, as I'm simply unable to > >> >> account for it. > >> >> > >> >> Any advice or thoughts would be very much welcomed! > >> >> > >> >> > >> >> Thanks in advance, > >> >> > >> >> > >> >> John > >> >> > >> > > >> > >> --__--__-- > >> > >> _______________________________________________ > >> Nagios-devel mailing list > >> Nag...@li... > >> https://lists.sourceforge.net/lists/listinfo/nagios-devel > >> > >> End of Nagios-devel Digest > >> > >-- > >this message and any attachments are confidential to the ordinary user of > >the e-mail address to which it was addressed and may also be privilaged. > >If you receive this message in error please immediately delete it and all > > copies of it from your system, destroy any hard copies of it and notify > >the sender. You must not, directly or indirectly, use, disclose,distribute, > >print or copy any part of this message if you are not the intended > >resipient. Internet communications cannot be guaranteed to be secure > >or error-free as information could be intercepted, corrupted, lost, > >arrive late or contain viruses or unauthorized amendments. The sender, > >therfore, does not accept liability for any errors or ommission in the > >content of this message or any damage or other consequences arising as > >a result of Internet Transmition. Opinions, conclusions and/or > >other information in this transmitted messxage that do not relate > >to official business of CITIGROUP shall be understood as neither given > >nor endorsed by CitiGroup. > > > > > > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: AMD - Your access to the experts > on Hammer Technology! Open Source & Linux Developers, register now > for the AMD Developer Symposium. Code: EX8664 > http://www.developwithamd.com/developerlab > _______________________________________________ > Nagios-devel mailing list > Nag...@li... > https://lists.sourceforge.net/lists/listinfo/nagios-devel > Ethan Galstad, Nagios Developer --- Email: na...@na... Website: http://www.nagios.org |