Menu

Passive checks stop executing

Help
Jeremy
2008-05-09
2013-04-25
  • Jeremy

    Jeremy - 2008-05-09

    We're starting to try and use NC_NET on some remote machines that we don't have direct connectivity to, so we're having them submit passive checks to our central Nagios server. However after a day or two the passive checks stop coming in. It seems to take restarting NC_NET service repeatedly and also sending "-v CONFIG -l passive_check,true" a few times to get the checks to start coming in again if we're lucky. So far this is just on a local test machine so luckily I can call check_nt from our linux nagios server to send that passive_check,true command - but in the wild we won't have connectivity to do this, so I'm trying to get to the bottom of why this is happening. We have passive_alwayson being set to true so it shouldn't be stopping as far as I can tell.

    Today the checks stopped coming in again. I did a "-v GETALLCHECKS" and I could see all the passive results listed but there wasn't a timestamp given. I still have connectivity to the nagios server though.

    passive_check    true
    passive_alwayson    true
    lock_passive_config    true
    interval_passive    5
    interval_div_passive    1
    perfdata_format    2
    embedded_send_nsca    false
    port_passive    5667
    host_passive    JEREMY
    pass_passive    xyz
    ip_passive    (our Nagios server's IP)
    Passive_timeout    10
    external_send_nsca    true
    external_send_nsca_app    C:\gsi-tools\nsca\ external_send_nsca_ip    (our Nagios server's IP)
    external_send_nsca_port    5667
    external_send_nsca_timeout    10

    We're using send_nsca.exe using triple DES encryption, our send_nsca.cfg is just:
    password=xyz
    encryption_method=3

    Thanks for any help,
    Jeremy

     
    • tony

      tony - 2008-05-10

      Hi Jeremy,

      TO check the Passive.Log REmotely run the check_NC_NEt command COMMAND_PRINT:
      ./check_nc_net -v COMMAND_PRINT -l"..\\config\\passive.log"
      this should give the time and STDOUT/STDERR of the Last PassiveCheck run.

      Since you are using DES you have no choice but to use the external Send_NSCA.
      The config is fine since it works for a day or so before issues arrise.
      THe problem could be anywhere but I suspect it is not internal to NC_Net.

      Increasing the timeout may be able to help if latency is a source, particuarly if the Host is on a remote site throught the internet. See how ping responds when the issue is occuring.

      IF submitting an CHECK_NT command to the active port is having issues at the same time this may be a network issue.   
      Other network factors could be Firewall, routers, Antivirus, Bad hardware causing latency on the entire subnet.  I know some backup systems and other apps add high load to the Host that may interfear with the system responding.

      Just in case, Checking UPTIME may tell you if there is any reboots happening at the remote site.

      If not setup already, Configure a few hosts to use NSCA that way you can differentiate betweeen a Single Host stop reporting and NSCA on the Nagios server submitting to the Command pipe.

      I assume you are also on version 4.x of NC_Net Since that uses the App Path for the location of SEND_NSCA.

      With Passive Always on there is no point in turning Passive off then back on since that would just induce a Delay of interval_passive/interval_div_passive minute delay before any checks are seen.

      If the passive checks had an error trying to submit to the network, then It should wait Passive interval before retrying.

      For Debugging, Some techniques are:

      Checking the NC_NET Running Config.
      ./Check_nc_net -v ENUMCONFIG
      If passive checks turned themselves off this would be seen.

      When Passive checks are run they save a copy of the results in memory for GETALLCHECKS to report on.  GETALLCHECKS will only have data if passive checks ran at least once,
      ENUMCHECK reports the Name of all passive checks.  if this does not match what you expect odds are the service check is in the passive config and should be removed or it was an old check that was removed but ran after the last restart of NC_NEt this chould be removed via DELCHECK (but if the check is still in the passive log it will be back after the next run of Passive checks)

      GETCHECKTIME is used to get the time from the GETALLCHECKS. 

      checking the Windows Application Log,
      from ./check_nc_net -v EVENTLOG_NEW -l "Application^^60^Nc_Net^^"
      Most Passive check errors have been removed from NC_NEt since Passive checks are normally reliable and to prevent flooding the event log.

      Sorry for the leangh I wrote it from the bottom to top.
      TOny

       
      • Jeremy

        Jeremy - 2008-05-12

        Thanks for the response (and for NC_NET in the first place)!

        I'm using NC_Net v4.1a.

        It's definitely not a problem with NSCA on our Nagios server, as we use a distributed setup with about 10 distributed Nagios servers. We use the check_freshness settings so we'd get about 15000 alerts if NSCA died and no checks were reported for anything ;-)

        The last time this broke the passive.log was not being updated anymore, even after restarting the NC_NET service several times and waiting 10+ minutes. I'm curious to see what -v ENUMCONFIG will show if this breaks again. I will also call send_nsca.exe manually just to make sure it can still send a message on tcp 5667 or if there's some problem with it.

        It has not stopped working again yet. I'll write more when/if I run into this again. It has happened twice now but maybe no more, knock on wood.

        Cheers
        Jeremy

         
        • tony

          tony - 2008-05-12

          That sounds like a good plan.
          However based on the details of your responce it sounds like there may be some external (from NC_NEt) access to the Config or Log file when the problem occurs.

          It is posible that some other Activity on the Windows Host could be interfering.
          To be explicit:  Backups or Antivirus or some other Application or user browsing around.  To explain:
          NC_NEt requires ReadWrite access to the PAssive.log and passive.cfg files.  (write access to the Passive.cfg when it is pushing passive.cfg settings into the passive config.) NC_NEt uses DotNet mutX Command to determin if another part of NC_NEt is currently using these files.  then it serializes the access.  HOWEVER NC_NEt does not detect if an external resource is preventing Passive.log or passive.cfg form opening.  Thus if some other app or service is accessing the file the passive checks May fail.  And will not run properly until after the files have been released by the external resource.   (Note: when a file is opened by most apps it is still availible for reading and occationally for writing (like Notepad.exe). However some apps Like Word put a LOCK on the file preventing reading until it is released) THus Making sure Backups and/or Antivirus Exclude The NC_NEt Config folder.  Backups can always either A) Stop NC_NET then Restart when done via NEt Start|Stop NC_NEt B) Backups of NC_NEt/Config may be done via FileCopy or some other CVS since they are relatiivly static after a rollout.

          THe NC_NEt command_Print command may be able to shed some light on if the file is locked from reading (however I did not test it for file access)
          ./check_nc_net -v COMMAND_PRINT -l"..\\config\\passive.log"

          Good LUck

          TOny
          Please remember: Donations for NC_NEt are accepted at montitech.com

           

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.