Re: [Shinken-devel] Problem with timeouts
Status: Beta
Brought to you by:
naparuba
From: Felipe o. <openglx@StarByte.net> - 2015-05-15 10:43:24
|
Sorry, David, I'm out of ideas. Only difference from your config to mine is the use of modules on the poller: I don't use any. Have you tried upgrading to 2.4 just in case? Or maybe downgrading to 2.0.3 (that's the version I am). Regards On 14 May 2015 at 23:24, David Good <dg...@wi...> wrote: > Here's the poller.ini file I'm using: > > [daemon] > > #-- Global Configuration > #user=shinken ; if not set then by default it's the current user. > #group=shinken ; if not set then by default it's the current group. > # Set to 0 if you want to make this daemon NOT run > daemon_enabled=1 > > # Larger configurations need more threads (default is 8?) > daemon_thread_pool_size=50 > > #-- Path Configuration > # The daemon will chdir into the directory workdir when launched > # paths variables values, if not absolute paths, are relative to workdir. > # using default values for following config variables value: > workdir = /var/run/shinken > logdir = /var/log/shinken > pidfile=%(workdir)s/pollerd.pid > > #-- Network configuration > # host=0.0.0.0 > # port=7771 > # http_backend=auto > # idontcareaboutsecurity=0 > > #-- SSL configuration -- > use_ssl=0 > # WARNING : Put full paths for certs > #ca_cert=/etc/shinken/certs/ca.pem > #server_cert=/etc/shinken/certs/server.cert > #server_key=/etc/shinken/certs/server.key > #hard_ssl_name_check=0 > > #-- Local log management -- > # Enabled by default to ease troubleshooting > use_local_log=1 > local_log=%(logdir)s/pollerd.log > # accepted log level values= DEBUG,INFO,WARNING,ERROR,CRITICAL > log_level=INFO > #log_level=DEBUG > > And here's the poller.cfg file: > > > #=============================================================================== > # POLLER (S1_Poller) > > #=============================================================================== > # Description: The poller is responsible for: > # - Active data acquisition > # - Local passive data acquisition > # https://shinken.readthedocs.org/en/latest/08_configobjects/poller.html > > #=============================================================================== > define poller { > poller_name poller-1 > address shinken1.dc1.example.com > port 7771 > > ## Optional > spare 0 ; 1 = is a spare, 0 = is not a spare > manage_sub_realms 0 ; Does it take jobs from schedulers of > sub-Realms? > min_workers 0 ; Starts with N processes (0 = 1 per CPU) > max_workers 0 ; No more than N processes (0 = 1 per CPU) > processes_by_worker 256 ; Each worker manages N checks > polling_interval 1 ; Get jobs from schedulers each N seconds > timeout 3 ; Ping timeout > data_timeout 120 ; Data send timeout > max_check_attempts 3 ; If ping fails N or more, then the node is > dead > check_interval 60 ; Ping node every N seconds > > ## Interesting modules that can be used: > # - booster-nrpe = Replaces the check_nrpe binary. Therefore it > # enhances performances when there are lot of NRPE > # calls. > # - named-pipe = Allow the poller to read a nagios.cmd named pipe. > # This permits the use of distributed check_mk > checks > # should you desire it. > # - SnmpBooster = Snmp bulk polling module > modules named-pipe, booster-nrpe > > ## Advanced Features > #passive 0 ; For DMZ monitoring, set to 1 so the > connections > ; will be from scheduler -> poller. > > # Poller tags are the tag that the poller will manage. Use None as tag > name to manage > # untaggued checks > #poller_tags None > > # Enable https or not > use_ssl 0 > # enable certificate/hostname check, will avoid man in the middle > attacks > hard_ssl_name_check 0 > > > realm All > } > > > On 5/14/15 3:13 PM, David Good wrote: > > Here's another example of what I'm seeing -- In the arbiter log I'll see > something like this: > > [1431641122] INFO: [Shinken] [All] Trying to send configuration to poller > poller-1 > [1431641242] ERROR: [Shinken] Failed sending configuration for poller-1: > Connexion error to http://shinken1.dc1.example.com:7771/ > <http://shinken1.dc1.eharmony.com:7771/> : Operation timed out after > 120001 milliseconds with 0 bytes received > > > And then just a few seconds later: > > [1431641291] INFO: [Shinken] [All] Trying to send configuration to poller > poller-1 > [1431641291] INFO: [Shinken] [All] Dispatch OK of configuration 1 to > poller poller-1 > > And this poller is on the same server as the arbiter. I see this > happening sporadically for pretty much every daemon, causing the > configuration to be constantly in the process of being re-dispatched. This > is especially frustrating as I'm trying to test out some new configs adding > and removing hosts and services from monitoring. If it can't finish > dispatching it makes it hard to test :-/ > > On 5/14/15 2:49 PM, David Good wrote: > > > I doubt that was the case -- I was careful to make sure everything was > stopped before restarting. > > And now my problems have started up again. I may be forced to upgrade to > 2.4 to see if it helps any. Very frustrating. If that doesn't fix it, I > may be forced to fall back to nagios and gearman. It'd hate to do that as > we had promised that Shinken would scale better than Nagios. > > On 5/13/15 2:50 PM, Felipe openglx wrote: > > Play the lotto just in case ;) > My suspicion would be that your previous "restart" to adjust the thread > pool (or other testing) didn't kill all threads, hence why you had some > very unusual situations going on. > Let us know how it goes, best luck on getting the project delivered! > > Regards > > > On 13 May 2015 at 22:18, David Good <dg...@wi...> wrote: > >> It was all hosts, but I just reloaded with a new config, so we'll see if >> my luck holds :-) >> >> >> On 5/13/15 2:00 PM, Felipe openglx wrote: >> >> I've noticed that Shinken 2 doesn't go easily with kill. I've always >> done "pkill -9 -f shinken-" when needing to restart them. >> >> Glad to hear you got something working, David. All hosts or just a >> fraction of them? >> >> Regards >> >> On 13 May 2015 at 21:43, David Good <dg...@wi...> wrote: >> >>> >>> >>> OK, things seem to be stable now. I discovered that several of the >>> schedulers were using massive amounts of memory (over 30GB) causing the >>> kernel to try to kill them or their children. I restarted them, then >>> restarted anything that showed up as a problem in the arbiter log and >>> since then it's been stable. >>> >>> One odd thing though is that some of the daemons wouldn't die normally >>> -- I had to use 'kill -KILL' on them. >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> One dashboard for servers and applications across Physical-Virtual-Cloud >>> Widest out-of-the-box monitoring support with 50+ applications >>> Performance metrics, stats and reports that give you Actionable Insights >>> Deep dive visibility with transaction tracing using APM Insight. >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> _______________________________________________ >>> Shinken-devel mailing list >>> Shi...@li... >>> https://lists.sourceforge.net/lists/listinfo/shinken-devel >>> >> >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> >> >> >> _______________________________________________ >> Shinken-devel mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/shinken-devel >> >> >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Shinken-devel mailing list >> Shi...@li... >> https://lists.sourceforge.net/lists/listinfo/shinken-devel >> >> > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > > > _______________________________________________ > Shinken-devel mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/shinken-devel > > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > > > _______________________________________________ > Shinken-devel mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/shinken-devel > > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight.http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > > > _______________________________________________ > Shinken-devel mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/shinken-devel > > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Shinken-devel mailing list > Shi...@li... > https://lists.sourceforge.net/lists/listinfo/shinken-devel > > |