You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(9) |
Feb
(11) |
Mar
(37) |
Apr
(50) |
May
(51) |
Jun
(31) |
Jul
(35) |
Aug
(30) |
Sep
(39) |
Oct
(85) |
Nov
(91) |
Dec
(30) |
2003 |
Jan
(65) |
Feb
(56) |
Mar
(80) |
Apr
(73) |
May
(84) |
Jun
(108) |
Jul
(62) |
Aug
(53) |
Sep
(46) |
Oct
(44) |
Nov
(59) |
Dec
(91) |
2004 |
Jan
(71) |
Feb
(25) |
Mar
(87) |
Apr
(67) |
May
(65) |
Jun
(60) |
Jul
(31) |
Aug
(100) |
Sep
(97) |
Oct
(133) |
Nov
(87) |
Dec
(65) |
2005 |
Jan
(84) |
Feb
(133) |
Mar
(67) |
Apr
(74) |
May
(108) |
Jun
(92) |
Jul
(54) |
Aug
(82) |
Sep
(22) |
Oct
(10) |
Nov
(69) |
Dec
(65) |
2006 |
Jan
(89) |
Feb
(133) |
Mar
(100) |
Apr
(69) |
May
(138) |
Jun
(85) |
Jul
(45) |
Aug
(56) |
Sep
(33) |
Oct
(69) |
Nov
(33) |
Dec
(78) |
2007 |
Jan
(120) |
Feb
(120) |
Mar
(136) |
Apr
(153) |
May
(75) |
Jun
(65) |
Jul
(64) |
Aug
(129) |
Sep
(122) |
Oct
(231) |
Nov
(145) |
Dec
(93) |
2008 |
Jan
(163) |
Feb
(157) |
Mar
(60) |
Apr
(96) |
May
(92) |
Jun
(67) |
Jul
(90) |
Aug
(75) |
Sep
(87) |
Oct
(81) |
Nov
(158) |
Dec
(71) |
2009 |
Jan
(83) |
Feb
(77) |
Mar
(94) |
Apr
(78) |
May
(210) |
Jun
(146) |
Jul
(104) |
Aug
(208) |
Sep
(80) |
Oct
(54) |
Nov
(55) |
Dec
(84) |
2010 |
Jan
(21) |
Feb
(67) |
Mar
(85) |
Apr
(20) |
May
(29) |
Jun
(11) |
Jul
(13) |
Aug
(48) |
Sep
(56) |
Oct
(70) |
Nov
(100) |
Dec
(83) |
2011 |
Jan
(59) |
Feb
(36) |
Mar
(59) |
Apr
(21) |
May
(96) |
Jun
(27) |
Jul
(17) |
Aug
(58) |
Sep
(12) |
Oct
(15) |
Nov
(27) |
Dec
(20) |
2012 |
Jan
(36) |
Feb
(3) |
Mar
(19) |
Apr
(9) |
May
(22) |
Jun
(34) |
Jul
(7) |
Aug
(31) |
Sep
(65) |
Oct
(18) |
Nov
(78) |
Dec
(24) |
2013 |
Jan
(46) |
Feb
(73) |
Mar
(50) |
Apr
(11) |
May
(32) |
Jun
(10) |
Jul
(6) |
Aug
(18) |
Sep
(36) |
Oct
|
Nov
|
Dec
|
From: AL13N <al...@rm...> - 2013-09-04 10:55:37
|
> > On Sep 3, 2013, at 5:34 PM, AL13N <al...@rm...> wrote: > >> Op dinsdag 3 september 2013 17:22:32 schreef Andreas Ericsson: >>> On 2013-09-02 21:20, AL13N wrote: >>>> Hi, >>>> >>>> i'm a Mageia distribution packager, and i'm maintaining some event >>>> broker >>>> modules. >>>> >>>> since these are packaged seperately we're trying to get things to work >>>> out-of- the-box for our users, specifically adding event brokers in >>>> the >>>> configuration file for nagios. >>>> >>>> the problem we're facing is that: We don't want a nagios addon package >>>> to >>>> rewrite the configuration file itself. however, we would like to add >>>> the >>>> event broker (or other settings) and then reload/restart nagios. >>>> >>>> a beneficial thing would be a configuration directory (much like it >>>> exists >>>> with hosts and commands and stuff like that) but for the main nagios >>>> configuration file. >>>> >>>> and if while we're at it, to have multiple directories, where >>>> /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden >>>> in >>>> /etc/nagios.d/ (for instance) (this is a convention that's growing in >>>> use) >>>> >>>> I can try to start making a patch for this, but i'm asking about this >>>> first, in order to find out if it makes a chance of getting accepted. >>> >>> Not only does it stand a chance of getting accepted; It's on the >>> roadmap. If you start working on it I'll help as much as I can and >>> I'm willing to accept less-than-awesome quality code, although I'll >>> polish such later if it turns out to be an issue. >>> >>> The only real requirement is that any "main" config file option >>> should be settable from any file in the dropdir. >> >> >> the idea i thought might be doable, would be to allow multiple -d >> options to >> nagios, and allow those to be a file or a directory. >> >> this would keep working: >> >> [ ]# /usr/sbin/nagios -d /etc/nagios/nagios.cfg >> >> and allow >> >> [ ]# /usr/sbin/nagios -d /usr/share/nagios/conf.d/ -d >> /etc/nagios/conf.d/ >> >> to work too. >> >> internally, instead of one file, we'll have to have an array of >> files/dirs, so >> that reloading nagios would also expand the config files/dirs into files >> and >> parse them all in order... >> >> > > remember too that -d is daemonize mode so probably need another option… yes, i guess i meant [ ]# nagios -d /usr/share/nagios/conf.d/ /etc/nagios/conf.d/ but it seems we're looking into it differently |
From: Andreas E. <ae...@op...> - 2013-09-04 10:54:12
|
On 2013-09-04 11:37, Jochen Bern wrote: > On 04.09.2013 11:03, Andreas Ericsson wrote: >> On 2013-09-04 10:31, Jonas Meurer wrote: >>> The indisputable part of this change is, that users are allowed to see >>> hostgroups and servicegroups with at least one authorized host or >>> service. Unclear is, whether this means "group and all its group >>> members", or "group and only authorized group members". >> >> It should mean "group and only authorized group members, except also >> hosts for services where one is authorized to see the service". > [...] >> Well, it *was* by design, but now I'm changing the design. It's a good >> time for it, since 4.0 is about to come out. I think the security teams >> can move on and we'll consider this "changed" rather than "fixed" for >> 4.0, where we do some security tightening. > > Since you do seem to be willing to ponder the system of access rights > and its security implications: I haven't checked the 4.x prereleases > yet, does being authorized to see a host's information still necessarily > provide access to *all* services on it? > AFAIK, yes. Please understand that I'm very uninterested in changes to the UI though, and I'd be much (much) happier if UI and core were split into two different components. > In the "customers accessing provider's Nagios" scenario, I suppose that > the customer might be interested in seeing "application is running" but > not, say, "the snmpd that ties this machine to the provider's NMS is > acting up" ... > I agree. The problem is that with access to the host comes access to commands that affect all services on that host as well, so it's not necessarily as clearcut as "disable viewing here and we're done", if one wants to do things properly. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: AL13N <al...@rm...> - 2013-09-04 10:51:21
|
> On 2013-09-04 00:34, AL13N wrote: >> Op dinsdag 3 september 2013 17:22:32 schreef Andreas Ericsson: >>> On 2013-09-02 21:20, AL13N wrote: >>>> Hi, >>>> >>>> i'm a Mageia distribution packager, and i'm maintaining some event >>>> broker >>>> modules. >>>> >>>> since these are packaged seperately we're trying to get things to work >>>> out-of- the-box for our users, specifically adding event brokers in >>>> the >>>> configuration file for nagios. >>>> >>>> the problem we're facing is that: We don't want a nagios addon package >>>> to >>>> rewrite the configuration file itself. however, we would like to add >>>> the >>>> event broker (or other settings) and then reload/restart nagios. >>>> >>>> a beneficial thing would be a configuration directory (much like it >>>> exists >>>> with hosts and commands and stuff like that) but for the main nagios >>>> configuration file. >>>> >>>> and if while we're at it, to have multiple directories, where >>>> /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden >>>> in >>>> /etc/nagios.d/ (for instance) (this is a convention that's growing in >>>> use) >>>> >>>> I can try to start making a patch for this, but i'm asking about this >>>> first, in order to find out if it makes a chance of getting accepted. >>> >>> Not only does it stand a chance of getting accepted; It's on the >>> roadmap. If you start working on it I'll help as much as I can and >>> I'm willing to accept less-than-awesome quality code, although I'll >>> polish such later if it turns out to be an issue. >>> >>> The only real requirement is that any "main" config file option >>> should be settable from any file in the dropdir. >> >> >> the idea i thought might be doable, would be to allow multiple -d >> options to >> nagios, and allow those to be a file or a directory. >> > > I'd much rather see this as "include=/some/path.d" where we read all > files ending with .cfg and .conf if it's a directory, or the file if > it's a fail, and then parse each include'd file in order of > include-statement -> lexicographical order > so that > > include=/etc/nagios.conf.d/ > include=/etc/nagios.perfdata.d/ > > would cause all files from /etc/nagios.conf.d to be parsed before all > files from /etc/nagios.perfdata.d. > > The included files can include other files, if they're so inclined. so, you mean to keep one nagios.cfg file and have include=/path/to/foo. would this mean that if you have setting1=value setting2=value include=/path/to/foo setting3=value that being parsed would: 1) setting1 2) setting2 3) all from /path/to/foo if it's a file or directory 4) setting3 is this the priority that you aim for? |
From: Jochen B. <Jochen.Bern@LINworks.de> - 2013-09-04 09:37:42
|
On 04.09.2013 11:03, Andreas Ericsson wrote: > On 2013-09-04 10:31, Jonas Meurer wrote: >> The indisputable part of this change is, that users are allowed to see >> hostgroups and servicegroups with at least one authorized host or >> service. Unclear is, whether this means "group and all its group >> members", or "group and only authorized group members". > > It should mean "group and only authorized group members, except also > hosts for services where one is authorized to see the service". [...] > Well, it *was* by design, but now I'm changing the design. It's a good > time for it, since 4.0 is about to come out. I think the security teams > can move on and we'll consider this "changed" rather than "fixed" for > 4.0, where we do some security tightening. Since you do seem to be willing to ponder the system of access rights and its security implications: I haven't checked the 4.x prereleases yet, does being authorized to see a host's information still necessarily provide access to *all* services on it? In the "customers accessing provider's Nagios" scenario, I suppose that the customer might be interested in seeing "application is running" but not, say, "the snmpd that ties this machine to the provider's NMS is acting up" ... Regards, J. Bern -- *NEU* - NEC IT-Infrastruktur-Produkte im <http://www.linworks-shop.de/>: Server--Storage--Virtualisierung--Management SW--Passion for Performance Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/> Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27 Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202 Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel |
From: Jonas M. <jo...@fr...> - 2013-09-04 09:08:39
|
Hey list and fellow Nagios developers, as you might have noticed, there's a discussion ongoing on oss-security[1] regarding bug report #456[2]. I'm the one who discovered the described issue, and I still believe that it's a bug with security implications, even though not everyone seems to be convinced. I'll try to give a brief description of the issue: The Nagios status.cgi (at all 3.4* and 4.0* versions I checked) leaks hostnames to unauthorized users as part of servicegroups. All of servicegroup overview, summary and grid list each and every hostname that is part of a servicegroup, regardless whether the HTTP user is listed in contacts/contactgroups for this host. In my opinion this is a security issue - at least on multi-user (e.g. multi-customer) Nagios-setups. I guess that most ISPs which give their customers access to the Nagios CGIs don't want to provide a full list of monitored hosts to their customers as a side-effect. One reason for confusion is the following entry from Nagios3 changelog[3]: 3.4.0 - 05/04/2012 ENHANCEMENTS [...] - Users can now see hostgroups and servicegroups that contain at least one host or service they are authorized for, instead of having to be authorized for them all (Ethan Galstad) The indisputable part of this change is, that users are allowed to see hostgroups and servicegroups with at least one authorized host or service. Unclear is, whether this means "group and all its group members", or "group and only authorized group members". Unfortunately, no Nagios developer speaked up yet about this issue. Thus there's still a lot confusion about it. You can find my patch at the Nagios Issue Tracker. This patch changes status.cgi behaviour to show only group members (hosts/services) that the user is authorized to see. A comment about this issue by the Nagios Developers whould be highly appreciated. In case that the described (and critizised) behaviour of status.cgi is intended, the distribution security teams can move on. If on the other hand you agree with me, that this issue should be fixed, I'll continue to work with the security teams in order to provide patched Nagios packages for their distributions. Thanks for your work on Nagios, it's a very valuable piece of software! Kind regards, jonas [1] http://www.openwall.com/lists/oss-security/2013/06/26/6 [2] http://tracker.nagios.org/view.php?id=456 [3] http://www.nagios.org/projects/nagioscore/history/core-3x |
From: Andreas E. <ae...@op...> - 2013-09-04 09:03:20
|
On 2013-09-04 10:31, Jonas Meurer wrote: > Hey list and fellow Nagios developers, > > as you might have noticed, there's a discussion ongoing on oss-security[1] > regarding bug report #456[2]. > > I'm the one who discovered the described issue, and I still believe that > it's a bug with security implications, even though not everyone seems to > be convinced. > > I'll try to give a brief description of the issue: > > The Nagios status.cgi (at all 3.4* and 4.0* versions I checked) leaks > hostnames to unauthorized users as part of servicegroups. All of > servicegroup overview, summary and grid list each and every hostname that > is part of a servicegroup, regardless whether the HTTP user is listed in > contacts/contactgroups for this host. > > In my opinion this is a security issue - at least on multi-user (e.g. > multi-customer) Nagios-setups. I guess that most ISPs which give their > customers access to the Nagios CGIs don't want to provide a full list > of monitored hosts to their customers as a side-effect. > > One reason for confusion is the following entry from Nagios3 changelog[3]: > > 3.4.0 - 05/04/2012 > ENHANCEMENTS > [...] > - Users can now see hostgroups and servicegroups that contain at least > one host or service they are authorized for, instead of having to > be authorized for them all (Ethan Galstad) > > > The indisputable part of this change is, that users are allowed to see > hostgroups and servicegroups with at least one authorized host or > service. Unclear is, whether this means "group and all its group > members", or "group and only authorized group members". > It should mean "group and only authorized group members, except also hosts for services where one is authorized to see the service". > Unfortunately, no Nagios developer speaked up yet about this issue. Thus > there's still a lot confusion about it. > Well, now I have, so confusion dispelled. > You can find my patch at the Nagios Issue Tracker. Ah, right. Care to provide a link? Mostly, I prefer to get patches to this mailing list, since I don't spend a lot of time hunting them down from the (underused) tracker. > This patch changes > status.cgi behaviour to show only group members (hosts/services) that > the user is authorized to see. > > A comment about this issue by the Nagios Developers whould be highly > appreciated. In case that the described (and critizised) behaviour of > status.cgi is intended, the distribution security teams can move on. > Well, it *was* by design, but now I'm changing the design. It's a good time for it, since 4.0 is about to come out. I think the security teams can move on and we'll consider this "changed" rather than "fixed" for 4.0, where we do some security tightening. > If on the other hand you agree with me, that this issue should be > fixed, I'll continue to work with the security teams in order to > provide patched Nagios packages for their distributions. > > Thanks for your work on Nagios, it's a very valuable piece of software! > Thanks for enjoying it. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Andreas E. <ae...@op...> - 2013-09-04 08:57:17
|
On 2013-09-04 00:34, AL13N wrote: > Op dinsdag 3 september 2013 17:22:32 schreef Andreas Ericsson: >> On 2013-09-02 21:20, AL13N wrote: >>> Hi, >>> >>> i'm a Mageia distribution packager, and i'm maintaining some event broker >>> modules. >>> >>> since these are packaged seperately we're trying to get things to work >>> out-of- the-box for our users, specifically adding event brokers in the >>> configuration file for nagios. >>> >>> the problem we're facing is that: We don't want a nagios addon package to >>> rewrite the configuration file itself. however, we would like to add the >>> event broker (or other settings) and then reload/restart nagios. >>> >>> a beneficial thing would be a configuration directory (much like it exists >>> with hosts and commands and stuff like that) but for the main nagios >>> configuration file. >>> >>> and if while we're at it, to have multiple directories, where >>> /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden in >>> /etc/nagios.d/ (for instance) (this is a convention that's growing in use) >>> >>> I can try to start making a patch for this, but i'm asking about this >>> first, in order to find out if it makes a chance of getting accepted. >> >> Not only does it stand a chance of getting accepted; It's on the >> roadmap. If you start working on it I'll help as much as I can and >> I'm willing to accept less-than-awesome quality code, although I'll >> polish such later if it turns out to be an issue. >> >> The only real requirement is that any "main" config file option >> should be settable from any file in the dropdir. > > > the idea i thought might be doable, would be to allow multiple -d options to > nagios, and allow those to be a file or a directory. > I'd much rather see this as "include=/some/path.d" where we read all files ending with .cfg and .conf if it's a directory, or the file if it's a fail, and then parse each include'd file in order of include-statement -> lexicographical order so that include=/etc/nagios.conf.d/ include=/etc/nagios.perfdata.d/ would cause all files from /etc/nagios.conf.d to be parsed before all files from /etc/nagios.perfdata.d. The included files can include other files, if they're so inclined. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Daniel W. <dwi...@gm...> - 2013-09-03 23:55:07
|
On Sep 3, 2013, at 5:34 PM, AL13N <al...@rm...> wrote: > Op dinsdag 3 september 2013 17:22:32 schreef Andreas Ericsson: >> On 2013-09-02 21:20, AL13N wrote: >>> Hi, >>> >>> i'm a Mageia distribution packager, and i'm maintaining some event broker >>> modules. >>> >>> since these are packaged seperately we're trying to get things to work >>> out-of- the-box for our users, specifically adding event brokers in the >>> configuration file for nagios. >>> >>> the problem we're facing is that: We don't want a nagios addon package to >>> rewrite the configuration file itself. however, we would like to add the >>> event broker (or other settings) and then reload/restart nagios. >>> >>> a beneficial thing would be a configuration directory (much like it exists >>> with hosts and commands and stuff like that) but for the main nagios >>> configuration file. >>> >>> and if while we're at it, to have multiple directories, where >>> /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden in >>> /etc/nagios.d/ (for instance) (this is a convention that's growing in use) >>> >>> I can try to start making a patch for this, but i'm asking about this >>> first, in order to find out if it makes a chance of getting accepted. >> >> Not only does it stand a chance of getting accepted; It's on the >> roadmap. If you start working on it I'll help as much as I can and >> I'm willing to accept less-than-awesome quality code, although I'll >> polish such later if it turns out to be an issue. >> >> The only real requirement is that any "main" config file option >> should be settable from any file in the dropdir. > > > the idea i thought might be doable, would be to allow multiple -d options to > nagios, and allow those to be a file or a directory. > > this would keep working: > > [ ]# /usr/sbin/nagios -d /etc/nagios/nagios.cfg > > and allow > > [ ]# /usr/sbin/nagios -d /usr/share/nagios/conf.d/ -d /etc/nagios/conf.d/ > > to work too. > > internally, instead of one file, we'll have to have an array of files/dirs, so > that reloading nagios would also expand the config files/dirs into files and > parse them all in order... > > remember too that -d is daemonize mode so probably need another option… Dan |
From: AL13N <al...@rm...> - 2013-09-03 22:34:50
|
Op dinsdag 3 september 2013 17:22:32 schreef Andreas Ericsson: > On 2013-09-02 21:20, AL13N wrote: > > Hi, > > > > i'm a Mageia distribution packager, and i'm maintaining some event broker > > modules. > > > > since these are packaged seperately we're trying to get things to work > > out-of- the-box for our users, specifically adding event brokers in the > > configuration file for nagios. > > > > the problem we're facing is that: We don't want a nagios addon package to > > rewrite the configuration file itself. however, we would like to add the > > event broker (or other settings) and then reload/restart nagios. > > > > a beneficial thing would be a configuration directory (much like it exists > > with hosts and commands and stuff like that) but for the main nagios > > configuration file. > > > > and if while we're at it, to have multiple directories, where > > /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden in > > /etc/nagios.d/ (for instance) (this is a convention that's growing in use) > > > > I can try to start making a patch for this, but i'm asking about this > > first, in order to find out if it makes a chance of getting accepted. > > Not only does it stand a chance of getting accepted; It's on the > roadmap. If you start working on it I'll help as much as I can and > I'm willing to accept less-than-awesome quality code, although I'll > polish such later if it turns out to be an issue. > > The only real requirement is that any "main" config file option > should be settable from any file in the dropdir. the idea i thought might be doable, would be to allow multiple -d options to nagios, and allow those to be a file or a directory. this would keep working: [ ]# /usr/sbin/nagios -d /etc/nagios/nagios.cfg and allow [ ]# /usr/sbin/nagios -d /usr/share/nagios/conf.d/ -d /etc/nagios/conf.d/ to work too. internally, instead of one file, we'll have to have an array of files/dirs, so that reloading nagios would also expand the config files/dirs into files and parse them all in order... |
From: Andreas E. <ae...@op...> - 2013-09-03 15:22:43
|
On 2013-09-02 21:20, AL13N wrote: > Hi, > > i'm a Mageia distribution packager, and i'm maintaining some event broker > modules. > > since these are packaged seperately we're trying to get things to work out-of- > the-box for our users, specifically adding event brokers in the configuration > file for nagios. > > the problem we're facing is that: We don't want a nagios addon package to > rewrite the configuration file itself. however, we would like to add the event > broker (or other settings) and then reload/restart nagios. > > a beneficial thing would be a configuration directory (much like it exists with > hosts and commands and stuff like that) but for the main nagios configuration > file. > > and if while we're at it, to have multiple directories, where > /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden in > /etc/nagios.d/ (for instance) (this is a convention that's growing in use) > > I can try to start making a patch for this, but i'm asking about this first, in > order to find out if it makes a chance of getting accepted. > Not only does it stand a chance of getting accepted; It's on the roadmap. If you start working on it I'll help as much as I can and I'm willing to accept less-than-awesome quality code, although I'll polish such later if it turns out to be an issue. The only real requirement is that any "main" config file option should be settable from any file in the dropdir. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: AL13N <al...@rm...> - 2013-09-02 19:37:23
|
Hi, i'm a Mageia distribution packager, and i'm maintaining some event broker modules. since these are packaged seperately we're trying to get things to work out-of- the-box for our users, specifically adding event brokers in the configuration file for nagios. the problem we're facing is that: We don't want a nagios addon package to rewrite the configuration file itself. however, we would like to add the event broker (or other settings) and then reload/restart nagios. a beneficial thing would be a configuration directory (much like it exists with hosts and commands and stuff like that) but for the main nagios configuration file. and if while we're at it, to have multiple directories, where /usr/share/nagios/nagios.d/ has the "defaults" which can be overridden in /etc/nagios.d/ (for instance) (this is a convention that's growing in use) I can try to start making a patch for this, but i'm asking about this first, in order to find out if it makes a chance of getting accepted. Regards, Maarten |
From: Mattias R. <mat...@op...> - 2013-08-30 12:41:03
|
On Fri, Aug 30, 2013 at 2:18 PM, Andreas Ericsson <ae...@op...> wrote: > On 2013-08-29 14:42, Mattias Ryrlén wrote: > >> commit f0054ac856059183f6c7bc3f9e8a80**965a8c0368 (HEAD, master) >> Author: Mattias Ryrlén <mat...@op...> >> Date: Thu Aug 29 14:34:39 2013 +0200 >> >> YES! test: build on machines with only one cpu >> >> There might be someone out there who still have only one cpu/core, >> let >> those >> build too. >> >> Signed-off-by: Mattias Ryrlén <mat...@op...> >> >> > While I sort of agree that tests should pass on single-cpu machines as > well, this change renders the whole test useless, since we *always* > return > 0 from online_cpus(). > > Adding a function (real_online_cpus()) which returns 0 when it can't > reliably test for number of online cpus, and letting that be wrapped > by online_cpus() so that the latter can retain its current behaviour > would solve the problem. online_cpus() should then look like this: > > unsigned int online_cpus(void) > { > unsigned int cpus = real_online_cpus(); > return cpus ? cpus : 1; > } > > while the new function real_online_cpus() must take the body of the > current online_cpus() function but be changed so that the fallback is > to return 0. The test must then be changed to use real_online_cpus() > instead. > > Would you like to reroll, or should I make the necessary changes and > credit you with an "Reported-by"? I reroll, have enough lvl80 chars anyway.. > > > diff --git a/lib/test-nsutils.c b/lib/test-nsutils.c >> index 1a23ee1..f835d67 100644 >> --- a/lib/test-nsutils.c >> +++ b/lib/test-nsutils.c >> @@ -31,10 +31,10 @@ int main(int argc, char **argv) >> asprintf(&s1, "arg varg foo %d", 12); >> s2 = mkstr("arg varg foo %d", 12); >> ok_str(s1, s2, "mkstr() must build proper strings"); >> - if (online_cpus() > 1) { >> + if (online_cpus() > 0) { >> t_pass("%d online cpus detected", online_cpus()); >> } else { >> - t_fail("Do you really have only one cpu core?"); >> + t_fail("No online cpus detected"); >> } >> return t_end(); >> } >> >> > > -- > Andreas Ericsson and...@op... > OP5 AB www.op5.se > Tel: +46 8-230225 Fax: +46 8-230231 > > Considering the successes of the wars on alcohol, poverty, drugs and > terror, I think we should give some serious thought to declaring war > on peace. > -- Vänliga hälsningar / Best Regards Mattias Ryrlén __________________________ op5 AB Första Långgatan 19 SE-413 27 Göteborg Mobil: +46 735-17 70 99 Support: +46 31-774 09 24 www.op5.com |
From: Andreas E. <ae...@op...> - 2013-08-30 12:19:08
|
On 2013-08-29 14:42, Mattias Ryrlén wrote: > commit f0054ac856059183f6c7bc3f9e8a80965a8c0368 (HEAD, master) > Author: Mattias Ryrlén <mat...@op...> > Date: Thu Aug 29 14:34:39 2013 +0200 > > YES! test: build on machines with only one cpu > > There might be someone out there who still have only one cpu/core, let > those > build too. > > Signed-off-by: Mattias Ryrlén <mat...@op...> > While I sort of agree that tests should pass on single-cpu machines as well, this change renders the whole test useless, since we *always* return > 0 from online_cpus(). Adding a function (real_online_cpus()) which returns 0 when it can't reliably test for number of online cpus, and letting that be wrapped by online_cpus() so that the latter can retain its current behaviour would solve the problem. online_cpus() should then look like this: unsigned int online_cpus(void) { unsigned int cpus = real_online_cpus(); return cpus ? cpus : 1; } while the new function real_online_cpus() must take the body of the current online_cpus() function but be changed so that the fallback is to return 0. The test must then be changed to use real_online_cpus() instead. Would you like to reroll, or should I make the necessary changes and credit you with an "Reported-by"? > diff --git a/lib/test-nsutils.c b/lib/test-nsutils.c > index 1a23ee1..f835d67 100644 > --- a/lib/test-nsutils.c > +++ b/lib/test-nsutils.c > @@ -31,10 +31,10 @@ int main(int argc, char **argv) > asprintf(&s1, "arg varg foo %d", 12); > s2 = mkstr("arg varg foo %d", 12); > ok_str(s1, s2, "mkstr() must build proper strings"); > - if (online_cpus() > 1) { > + if (online_cpus() > 0) { > t_pass("%d online cpus detected", online_cpus()); > } else { > - t_fail("Do you really have only one cpu core?"); > + t_fail("No online cpus detected"); > } > return t_end(); > } > -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Max S. <max...@op...> - 2013-08-27 11:39:59
|
From: Max Sikström <msi...@op...> Some per-object logging was removed with previous commit, which can be useful in many cases. This commit reimplements that logging in the related callbacks. Using logging in the callbacks instead of centrally means that more information can be outputted in the log, because each callback has much bigger domain knowledge than what the worker process itself has. Using this structure also keeps the worker API as clean as possible from domain-specific information. Signed-off-by: Max Sikström <msi...@op...> --- base/checks.c | 7 +++++ base/notifications.c | 18 ++++-------- base/sehandlers.c | 73 +++++++++++++++++++++++-------------------------- base/workers.c | 64 +++++++++++++++++++++++++++++-------------- include/workers.h | 10 +++++++ xdata/xpddefault.c | 16 +++++++--- 6 files changed, 111 insertions(+), 77 deletions(-) diff --git a/base/checks.c b/base/checks.c index c5c1ff2..e81ec35 100644 --- a/base/checks.c +++ b/base/checks.c @@ -68,6 +68,13 @@ int reap_check_results(void) { static void check_complete_callback(struct wproc_result *wpres, void *data, int flags) { check_result *cr = (check_result*)data; if(wpres) { + if (cr->object_check_type == SERVICE_CHECK) { + wproc_handle_err(wpres, WPROC_ERROR_ALL & ~WPROC_ERROR_EXITCODE, "Service check", "Host: %s Service: %s", cr->host_name, cr->service_description); + } + else if(cr->object_check_type == HOST_CHECK) { + wproc_handle_err(wpres, WPROC_ERROR_ALL & ~WPROC_ERROR_EXITCODE, "Host check", "Host: %s", cr->host_name); + } + /* Got a result. If wpres == NULL, only clean up */ memcpy(&cr->rusage, &wpres->rusage, sizeof(wpres->rusage)); diff --git a/base/notifications.c b/base/notifications.c index e9b6dee..7f85c83 100644 --- a/base/notifications.c +++ b/base/notifications.c @@ -73,21 +73,15 @@ const char *notification_reason_name(unsigned int reason_type) static void notification_callback(struct wproc_result *wpres, void *data, int flags) { struct notification_job *oj = (struct notification_job *)data; if(wpres) { - double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; - if (wpres->early_timeout) { - if (oj->service_description) { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of service '%s' on host '%s' by command '%s' timed out after %.2f seconds\n", - oj->contact_name, oj->service_description, - oj->host_name, wpres->command, runtime); - } else { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of host '%s' by command '%s' timed out after %.2f seconds\n", - oj->contact_name, oj->host_name, - wpres->command, runtime); + if (oj->service_description) { + wproc_handle_err(wpres, WPROC_ERROR_ALL, "Service notification", "Host: %s Service: %s Contact: %s\n", oj->host_name, oj->service_description, oj->contact_name ); + } + else { + wproc_handle_err(wpres, WPROC_ERROR_ALL, "Host notification", "Host: %s Contact: %s\n", oj->host_name, oj->contact_name ); } } - } free(oj); -} + } /* notify contacts about a service problem or recovery */ int service_notification(service *svc, int type, char *not_author, char *not_data, int options) { diff --git a/base/sehandlers.c b/base/sehandlers.c index 8df5f25..558eeb8 100644 --- a/base/sehandlers.c +++ b/base/sehandlers.c @@ -36,22 +36,42 @@ #endif /******************************************************************/ -/************* OBSESSIVE COMPULSIVE HANDLER FUNCTIONS *************/ +/********************* JOB COMPLETE CALLBACKS *********************/ /******************************************************************/ -static void ocsp_complete_callback(struct wproc_result *wpres, void *data, int flags) { - service *svc = (service *)data; +struct eventhandler_job { + service *svc; + host *hst; + const char *msg; +}; + +static void eventhandler_job_complete(struct wproc_result *wpres, void *data, int flags) { + struct eventhandler_job *job = (struct eventhandler_job*)data; if(wpres) { - if (wpres->early_timeout) { - double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCSP command '%s' for service '%s' on host '%s' timed out after %.2f seconds\n", - wpres->command, svc->description, svc->host_name, - runtime); + if( job->svc ) { + wproc_handle_err(wpres, WPROC_ERROR_ALL, job->msg, "Host: %s Service: %s\n", job->svc->host_name, job->svc->description ); + } + else if( job->hst ) { + wproc_handle_err(wpres, WPROC_ERROR_ALL, job->msg, "Host: %s\n", job->hst->name ); + } } + free(job); } + +static int eventhandler_run_job(char *command_line, int timeout, host *hst, service *svc, char *msg, nagios_macros *mac) { + struct eventhandler_job *job = calloc(1, sizeof(struct eventhandler_job)); + job->hst = hst; + job->svc = svc; + job->msg = msg; + /* run the command through a worker */ + return wproc_run_callback(command_line, timeout, eventhandler_job_complete, (void*)job, mac); } +/******************************************************************/ +/************* OBSESSIVE COMPULSIVE HANDLER FUNCTIONS *************/ +/******************************************************************/ + /* handles service check results in an obsessive compulsive manner... */ int obsessive_compulsive_service_check_processor(service *svc) { char *raw_command = NULL; @@ -102,7 +122,7 @@ int obsessive_compulsive_service_check_processor(service *svc) { log_debug_info(DEBUGL_CHECKS, 2, "Processed obsessive compulsive service processor command line: %s\n", processed_command); /* run the command through a worker */ - wproc_run_callback(processed_command, ocsp_timeout, ocsp_complete_callback, (void*)svc, &mac); + eventhandler_run_job(processed_command, ocsp_timeout, svc->host_ptr, svc, "OCSP", &mac); /* free memory */ clear_volatile_macros_r(&mac); @@ -112,16 +132,6 @@ int obsessive_compulsive_service_check_processor(service *svc) { } -static void ochp_complete_callback(struct wproc_result *wpres, void *data, int flags) { - host *hst = (host *)data; - if(wpres) { - if (wpres->early_timeout) { - double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCHP command '%s' for host '%s' timed out after %.2f seconds\n", - wpres->command, hst->name, runtime); - } - } -} /* handles host check results in an obsessive compulsive manner... */ int obsessive_compulsive_host_check_processor(host *hst) { @@ -167,7 +177,7 @@ int obsessive_compulsive_host_check_processor(host *hst) { log_debug_info(DEBUGL_CHECKS, 2, "Processed obsessive compulsive host processor command line: %s\n", processed_command); /* run the command through a worker */ - wproc_run_callback(processed_command, ochp_timeout, ochp_complete_callback, hst, &mac); + eventhandler_run_job(processed_command, ochp_timeout, hst, NULL, "OCHP", &mac); /* free memory */ clear_volatile_macros_r(&mac); @@ -183,17 +193,6 @@ int obsessive_compulsive_host_check_processor(host *hst) { /**************** SERVICE EVENT HANDLER FUNCTIONS *****************/ /******************************************************************/ -static void evthandler_complete_with_msg(struct wproc_result *wpres, void *data, int flags) { - char *msg = (char *)data; - if(wpres) { - if (wpres->early_timeout) { - double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, msg, - wpres->command, runtime); - } - } -} - /* handles changes in the state of a service */ int handle_service_event(service *svc) { @@ -313,8 +312,7 @@ int run_global_service_event_handler(nagios_macros *mac, service *svc) { #endif /* run the command through a worker */ - result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, - "Warning: Global service event handler command '%s' timed out after %.2f seconds\n" , mac); + result = eventhandler_run_job(processed_command, event_handler_timeout, svc->host_ptr, svc, "global service event handler", mac); #ifdef USE_EVENT_BROKER /* get end time */ @@ -406,8 +404,7 @@ int run_service_event_handler(nagios_macros *mac, service *svc) { #endif /* run the command through a worker */ - result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, - "Warning: Service event handler command '%s' timed out after %.2f seconds\n" , mac); + result = eventhandler_run_job(processed_command, event_handler_timeout, svc->host_ptr, svc, "service event handler", mac); #ifdef USE_EVENT_BROKER /* get end time */ @@ -543,8 +540,7 @@ int run_global_host_event_handler(nagios_macros *mac, host *hst) { #endif /* run the command through a worker */ - result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, - "Warning: Global host event handler command '%s' timed out after %.2f seconds\n" , mac); + result = eventhandler_run_job(processed_command, event_handler_timeout, hst, NULL, "global host event handler", mac); #ifdef USE_EVENT_BROKER /* get end time */ @@ -634,8 +630,7 @@ int run_host_event_handler(nagios_macros *mac, host *hst) { #endif /* run the command through a worker */ - result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, - "Warning: Host event handler command '%s' timed out after %.2f seconds\n" , mac); + result = eventhandler_run_job(processed_command, event_handler_timeout, hst, NULL, "host event handler", mac); #ifdef USE_EVENT_BROKER /* get end time */ diff --git a/base/workers.c b/base/workers.c index a6db4f5..2c773b3 100644 --- a/base/workers.c +++ b/base/workers.c @@ -492,7 +492,7 @@ static void fo_reassign_wproc_job(void *job_) static int handle_worker_result(int sd, int events, void *arg) { - char *buf, *error_reason = NULL; + char *buf; unsigned long size; int ret; static struct kvvec kvv = KVVEC_INITIALIZER; @@ -563,26 +563,6 @@ static int handle_worker_result(int sd, int events, void *arg) if (wpres.error_code == ETIME) { wpres.early_timeout = TRUE; } - if (wpres.early_timeout) { - asprintf(&error_reason, "timed out after %.2fs", tv_delta_f(&wpres.start, &wpres.stop)); - } - else if (WIFSIGNALED(wpres.wait_status)) { - asprintf(&error_reason, "died by signal %d%s after %.2f seconds", - WTERMSIG(wpres.wait_status), - WCOREDUMP(wpres.wait_status) ? " (core dumped)" : "", - tv_delta_f(&wpres.start, &wpres.stop)); - } - - if (error_reason) { - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: job %d from worker %s %s", - job->id, wp->name, error_reason); - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: command: %s\n", job->command); - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: early_timeout=%d; exited_ok=%d; wait_status=%d; error_code=%d;\n", - wpres.early_timeout, wpres.exited_ok, wpres.wait_status, wpres.error_code); - wproc_logdump_buffer(NSLOG_RUNTIME_ERROR, TRUE, "wproc: stderr", wpres.outerr); - wproc_logdump_buffer(NSLOG_RUNTIME_ERROR, TRUE, "wproc: stdout", wpres.outstd); - } - my_free(error_reason); /* Run the callback with the wpres struct as argument */ run_job_callback(job, &wpres, 0); @@ -781,6 +761,48 @@ int init_workers(int desired_workers) return 0; } +int wproc_handle_err(struct wproc_result *wpres, int errflags, const char *job_type, const char *fmt, ...) { + char *error_reason = NULL; + char *buffer = NULL; + va_list ap; + + if (wpres->early_timeout && (errflags & WPROC_ERROR_EARLYTIMEOUT)) { + asprintf(&error_reason, "timed out after %.2fs", tv_delta_f(&wpres->start, &wpres->stop)); + } + else if (WIFSIGNALED(wpres->wait_status) && (errflags & WPROC_ERROR_SIGNAL)) { + asprintf(&error_reason, "died by signal %d%s after %.2f seconds", + WTERMSIG(wpres->wait_status), + WCOREDUMP(wpres->wait_status) ? " (core dumped)" : "", + tv_delta_f(&wpres->start, &wpres->stop)); + } + else if (WEXITSTATUS(wpres->wait_status) != 0 && (errflags & WPROC_ERROR_EXITCODE)) { + asprintf(&error_reason, "Exited with return code %d", + WEXITSTATUS(wpres->wait_status)); + } + + if (error_reason) { + logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: %s job %d: %s", job_type, wpres->job_id, error_reason); + logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: command: %s\n", wpres->command); + logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: early_timeout=%d; exited_ok=%d; wait_status=%d; error_code=%d;\n", + wpres->early_timeout, wpres->exited_ok, wpres->wait_status, wpres->error_code); + + if(fmt) { + va_start(ap, fmt); + if(vasprintf(&buffer, fmt, ap) > 0) { + logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: %s\n", buffer); + free(buffer); + } + va_end(ap); + } + + wproc_logdump_buffer(NSLOG_RUNTIME_ERROR, TRUE, "wproc: stderr", wpres->outerr); + wproc_logdump_buffer(NSLOG_RUNTIME_ERROR, TRUE, "wproc: stdout", wpres->outstd); + my_free(error_reason); + return 1; + } + return 0; + } + /* * Handles adding the command and macros to the kvvec, * as well as shipping the command off to a designated diff --git a/include/workers.h b/include/workers.h index cf28287..172e7b4 100644 --- a/include/workers.h +++ b/include/workers.h @@ -5,6 +5,12 @@ #define WPROC_FORCE (1 << 0) +#define WPROC_ERROR_EXITCODE (1 << 0) +#define WPROC_ERROR_EARLYTIMEOUT (1 << 1) +#define WPROC_ERROR_SIGNAL (1 << 3) + +#define WPROC_ERROR_ALL ((1 << 4)-1) + NAGIOS_BEGIN_DECL; /* FIXME: Needs to be somewhere better... */ @@ -36,6 +42,10 @@ extern int wproc_can_spawn(struct load_control *lc); extern void free_worker_memory(int flags); extern int workers_alive(void); extern int init_workers(int desired_workers); + +extern int wproc_handle_err(struct wproc_result *wpres, int errflags, const char *job_type, const char *fmt, ...) + __attribute__((__format__(__printf__, 4, 5))); + extern int wproc_run_callback(char *command_line, int timeout, void (*cb)(struct wproc_result *, void *, int), void *data, nagios_macros *mac); NAGIOS_END_DECL; diff --git a/xdata/xpddefault.c b/xdata/xpddefault.c index 94bddbf..b8913ef 100644 --- a/xdata/xpddefault.c +++ b/xdata/xpddefault.c @@ -42,11 +42,17 @@ static int host_perfdata_fd = -1; static int service_perfdata_fd = -1; /******************************************************************/ -/********************** CHECK COMPLETE CALLBACK *******************/ +/********* PERFORMANCE DATA PROCESSING COMPLETE CALLBACK **********/ /******************************************************************/ -static void perfdata_command_callback(struct wproc_result *wpres, void *data, int flags) { - /* Dummy method as callback. we don't care about the result for now */ +static void perfdata_service_callback(struct wproc_result *wpres, void *data, int flags) { + service *svc = (service*)data; + wproc_handle_err(wpres, WPROC_ERROR_ALL, "Service perfdata processing", "Host: %s Service: %s\n", svc->host_name, svc->description ); +} + +static void perfdata_host_callback(struct wproc_result *wpres, void *data, int flags) { + host *hst = (host*)data; + wproc_handle_err(wpres, WPROC_ERROR_ALL, "Host perfdata processing", "Host: %s\n", hst->name ); } /******************************************************************/ @@ -345,7 +351,7 @@ int xpddefault_run_service_performance_data_command(nagios_macros *mac, service log_debug_info(DEBUGL_PERFDATA, 2, "Processed service performance data command line: %s\n", processed_command_line); /* run the command */ - wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_command_callback, NULL, NULL); + wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_service_callback, (void*)svc, NULL); /* free memory */ my_free(processed_command_line); @@ -386,7 +392,7 @@ int xpddefault_run_host_performance_data_command(nagios_macros *mac, host *hst) log_debug_info(DEBUGL_PERFDATA, 2, "Processed host performance data command line: %s\n", processed_command_line); /* run the command */ - wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_command_callback, NULL, NULL); + wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_host_callback, (void*)hst, NULL); /* free memory */ my_free(processed_command_line); -- 1.7.1 |
From: Andreas E. <ae...@op...> - 2013-08-26 09:35:23
|
On 2013-08-23 10:28, Max Sikstrom wrote: > From: Max Sikström <msi...@op...> > > A worker is a closed defined system that executes commands, and distributes it > over several processes, and does it quite well. > > Because of the nature of a job to be asyncrhonous, the simplest way is to let > worker handler job types to know what should be done when finished; start job, > do other things, do some stuff when job is done, dependent on what job type it > was. > > But that concept doesn't scale well when new kinds of jobs is needed, for > example through nebmods, or new systems in the system. The worker code gets > cluttered with non-worker related functionality, and some logic about the other > parts functionality needs to be done within the worker module. > This part of the message doesn't make much sense without changes being isolated mostly to lib/worker.c. Everything else relates to how the scheduler handles communication with the workers, and has nothing to do with actual workers as such. > The concept of callbacks and pointers to functions is a much cleaner way to > handle job results. Each module that uses the worker can define how to handle > the job result itself instead. No special cases is needed in the worker code. > Except there's no single-point-of-entry where we can log errors in case we do that, so each handler would need its own messages for that. I'm not exactly thrilled about that. So while we might be gaining cleaner code, we're losing error logging of helpers that for various reasons fail to do their job properly, or get killed by various signals. That's a real benefit to have, and "realler" than "clean code", so NAK in its current shape and form. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Eric S. <est...@na...> - 2013-08-24 15:05:08
|
I've created a tarball with a release candidate for Nagios Core 3.5.1. The tarball can be downloaded from http://sourceforge.net/projects/nagios/files/nagios-3.x/nagios-3.5.1/nagios-3.5.1rc1.tar.gz/download. Feel free to compile it and try it (on a test system, please). If I don't hear of any serious issues by the end of next week, I'll release it. The change log is as follows: * Added handler for SIGXFSZ signal (Eric Stanley) * Fixed bug #444: Nagios 3.5.0 problem with macro $ADMINEMAIL$ : @ is converted to %40 after 2 nagios reload (Duplicate of bug #407) * Fixed bug #407: Reloading nagios config causes spaces in notifications to become plus signs (Alexey Dvoryanchikov) * Fixed bug #445: Adding triggered downtime for child hosts causes a SIGSEGV on restart/reload (Eric Stanley) * Fixed bug #375: Freshness expiration never reached and bug #427: freshness threshold doesn't work if it is set long (Scott Wilkerson, Eric Stanley) * Fixed bug #432: Downtime scheduled as "Nagios Process" and not the Users name (Sam Lansing, Eric Stanley) -- Eric Stanley ___ Developer Nagios Enterprises, LLC Email: est...@na... Web: www.nagios.com |
From: Max S. <max...@op...> - 2013-08-23 08:29:24
|
From: Max Sikström <msi...@op...> A worker is a closed defined system that executes commands, and distributes it over several processes, and does it quite well. Because of the nature of a job to be asyncrhonous, the simplest way is to let worker handler job types to know what should be done when finished; start job, do other things, do some stuff when job is done, dependent on what job type it was. But that concept doesn't scale well when new kinds of jobs is needed, for example through nebmods, or new systems in the system. The worker code gets cluttered with non-worker related functionality, and some logic about the other parts functionality needs to be done within the worker module. The concept of callbacks and pointers to functions is a much cleaner way to handle job results. Each module that uses the worker can define how to handle the job result itself instead. No special cases is needed in the worker code. Reviewed-by: Robin Sonefors <rso...@op...> Signed-off-by: Max Sikström <msi...@op...> --- base/checks.c | 45 +++++++- base/notifications.c | 38 ++++++- base/sehandlers.c | 96 +++++++++-------- base/workers.c | 301 +++---------------------------------------------- include/workers.h | 24 +---- xdata/xpddefault.c | 11 ++- 6 files changed, 160 insertions(+), 355 deletions(-) diff --git a/base/checks.c b/base/checks.c index 76a9b7d..c5c1ff2 100644 --- a/base/checks.c +++ b/base/checks.c @@ -61,6 +61,47 @@ int reap_check_results(void) { +/******************************************************************/ +/********************** CHECK COMPLETE CALLBACK *******************/ +/******************************************************************/ + +static void check_complete_callback(struct wproc_result *wpres, void *data, int flags) { + check_result *cr = (check_result*)data; + if(wpres) { + /* Got a result. If wpres == NULL, only clean up */ + + memcpy(&cr->rusage, &wpres->rusage, sizeof(wpres->rusage)); + cr->start_time.tv_sec = wpres->start.tv_sec; + cr->start_time.tv_usec = wpres->start.tv_usec; + cr->finish_time.tv_sec = wpres->stop.tv_sec; + cr->finish_time.tv_usec = wpres->stop.tv_usec; + if (WIFEXITED(wpres->wait_status)) { + cr->return_code = WEXITSTATUS(wpres->wait_status); + } else { + cr->return_code = STATE_UNKNOWN; + } + + if (wpres->outstd && *wpres->outstd) { + cr->output = strdup(wpres->outstd); + } else if (wpres->outerr) { + asprintf(&cr->output, "(No output on stdout) stderr: %s", wpres->outerr); + } else { + cr->output = NULL; + } + + cr->early_timeout = wpres->early_timeout; + cr->exited_ok = wpres->exited_ok; + cr->engine = NULL; + cr->source = "something"; /* wp->name; */ + + process_check_result(cr); + } + + /* Always cleanup */ + free_check_result(cr); + free(cr); +} + /******************************************************************/ /****************** SERVICE MONITORING FUNCTIONS ******************/ @@ -293,7 +334,7 @@ int run_async_service_check(service *svc, int check_options, double latency, int svc->latency = old_latency; /* paw off the check to a worker to run */ - runchk_result = wproc_run_check(cr, processed_command, &mac); + runchk_result = wproc_run_callback(processed_command, service_check_timeout, check_complete_callback, (void*)cr, &mac); if (runchk_result == ERROR) { logit(NSLOG_RUNTIME_ERROR, TRUE, "Unable to run check for service '%s' on host '%s'\n", svc->description, svc->host_name); } @@ -2078,7 +2119,7 @@ int run_async_host_check(host *hst, int check_options, double latency, int sched /* reset latency (permanent value for this check will get set later) */ hst->latency = old_latency; - runchk_result = wproc_run_check(cr, processed_command, &mac); + runchk_result = wproc_run_callback(processed_command, service_check_timeout, check_complete_callback, (void*)cr, &mac); if (runchk_result == ERROR) { logit(NSLOG_RUNTIME_ERROR, TRUE, "Unable to send check for host '%s' to worker (ret=%d)\n", hst->name, runchk_result); } else { diff --git a/base/notifications.c b/base/notifications.c index 13ae651..e9b6dee 100644 --- a/base/notifications.c +++ b/base/notifications.c @@ -30,6 +30,13 @@ #include "../include/neberrors.h" #include "../include/workers.h" + +struct notification_job { + char *contact_name; + char *host_name; + char *service_description; +}; + /*** silly helpers ****/ static contact *find_contact_by_name_or_alias(const char *name) { @@ -59,11 +66,28 @@ const char *notification_reason_name(unsigned int reason_type) return "(unknown)"; } - /******************************************************************/ /***************** SERVICE NOTIFICATION FUNCTIONS *****************/ /******************************************************************/ +static void notification_callback(struct wproc_result *wpres, void *data, int flags) { + struct notification_job *oj = (struct notification_job *)data; + if(wpres) { + double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; + if (wpres->early_timeout) { + if (oj->service_description) { + logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of service '%s' on host '%s' by command '%s' timed out after %.2f seconds\n", + oj->contact_name, oj->service_description, + oj->host_name, wpres->command, runtime); + } else { + logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of host '%s' by command '%s' timed out after %.2f seconds\n", + oj->contact_name, oj->host_name, + wpres->command, runtime); + } + } + } + free(oj); +} /* notify contacts about a service problem or recovery */ int service_notification(service *svc, int type, char *not_author, char *not_data, int options) { @@ -798,7 +822,11 @@ int notify_contact_of_service(nagios_macros *mac, contact *cntct, service *svc, } /* run the notification command */ - wproc_notify(cntct->name, svc->host_name, svc->description, processed_command, mac); + struct notification_job *oj = calloc(1, sizeof(struct notification_job)); + oj->contact_name = cntct->name; + oj->host_name = svc->host_name; + oj->service_description = svc->description; + wproc_run_callback(processed_command, notification_timeout, notification_callback, (void*)oj, mac); /* free memory */ my_free(command_name); @@ -1692,7 +1720,11 @@ int notify_contact_of_host(nagios_macros *mac, contact *cntct, host *hst, int ty } /* run the notification command */ - wproc_notify(cntct->name, hst->name, NULL, processed_command, mac); + struct notification_job *oj = calloc(1, sizeof(struct notification_job)); + oj->contact_name = cntct->name; + oj->host_name = hst->name; + oj->service_description = NULL; + wproc_run_callback(processed_command, notification_timeout, notification_callback, (void*)oj, mac); /* @todo Handle nebmod stuff when getting results from workers */ diff --git a/base/sehandlers.c b/base/sehandlers.c index c51d904..8df5f25 100644 --- a/base/sehandlers.c +++ b/base/sehandlers.c @@ -39,6 +39,18 @@ /************* OBSESSIVE COMPULSIVE HANDLER FUNCTIONS *************/ /******************************************************************/ +static void ocsp_complete_callback(struct wproc_result *wpres, void *data, int flags) { + service *svc = (service *)data; + if(wpres) { + if (wpres->early_timeout) { + double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; + logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCSP command '%s' for service '%s' on host '%s' timed out after %.2f seconds\n", + wpres->command, svc->description, svc->host_name, + runtime); + } + } +} + /* handles service check results in an obsessive compulsive manner... */ int obsessive_compulsive_service_check_processor(service *svc) { @@ -90,7 +102,7 @@ int obsessive_compulsive_service_check_processor(service *svc) { log_debug_info(DEBUGL_CHECKS, 2, "Processed obsessive compulsive service processor command line: %s\n", processed_command); /* run the command through a worker */ - wproc_run_service_job(WPJOB_OCSP, ocsp_timeout, svc, processed_command, &mac); + wproc_run_callback(processed_command, ocsp_timeout, ocsp_complete_callback, (void*)svc, &mac); /* free memory */ clear_volatile_macros_r(&mac); @@ -100,6 +112,16 @@ int obsessive_compulsive_service_check_processor(service *svc) { } +static void ochp_complete_callback(struct wproc_result *wpres, void *data, int flags) { + host *hst = (host *)data; + if(wpres) { + if (wpres->early_timeout) { + double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; + logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCHP command '%s' for host '%s' timed out after %.2f seconds\n", + wpres->command, hst->name, runtime); + } + } +} /* handles host check results in an obsessive compulsive manner... */ int obsessive_compulsive_host_check_processor(host *hst) { @@ -145,7 +167,7 @@ int obsessive_compulsive_host_check_processor(host *hst) { log_debug_info(DEBUGL_CHECKS, 2, "Processed obsessive compulsive host processor command line: %s\n", processed_command); /* run the command through a worker */ - wproc_run_host_job(WPJOB_OCHP, ochp_timeout, hst, processed_command, &mac); + wproc_run_callback(processed_command, ochp_timeout, ochp_complete_callback, hst, &mac); /* free memory */ clear_volatile_macros_r(&mac); @@ -161,6 +183,17 @@ int obsessive_compulsive_host_check_processor(host *hst) { /**************** SERVICE EVENT HANDLER FUNCTIONS *****************/ /******************************************************************/ +static void evthandler_complete_with_msg(struct wproc_result *wpres, void *data, int flags) { + char *msg = (char *)data; + if(wpres) { + if (wpres->early_timeout) { + double runtime = ((double)wpres->runtime.tv_sec) + ((double)wpres->runtime.tv_usec) / 1000000.0; + logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, msg, + wpres->command, runtime); + } + } +} + /* handles changes in the state of a service */ int handle_service_event(service *svc) { @@ -212,7 +245,6 @@ int run_global_service_event_handler(nagios_macros *mac, service *svc) { char *raw_logentry = NULL; char *processed_logentry = NULL; char *command_output = NULL; - int early_timeout = FALSE; double exectime = 0.0; int result = 0; #ifdef USE_EVENT_BROKER @@ -269,7 +301,7 @@ int run_global_service_event_handler(nagios_macros *mac, service *svc) { /* send event data to broker */ end_time.tv_sec = 0L; end_time.tv_usec = 0L; - neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, global_service_event_handler, processed_command, NULL, NULL); + neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, global_service_event_handler, processed_command, NULL, NULL); /* neb module wants to override (or cancel) the event handler - perhaps it will run the eventhandler itself */ if(neb_result == NEBERROR_CALLBACKOVERRIDE) { @@ -281,21 +313,15 @@ int run_global_service_event_handler(nagios_macros *mac, service *svc) { #endif /* run the command through a worker */ - /* XXX FIXME make base/workers.c handle the eventbroker stuff below */ - result = wproc_run(WPJOB_GLOBAL_SVC_EVTHANDLER, processed_command, event_handler_timeout, mac); - - /* check to see if the event handler timed out */ - if(early_timeout == TRUE) - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, "Warning: Global service event handler command '%s' timed out after %d seconds\n", processed_command, event_handler_timeout); + result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, + "Warning: Global service event handler command '%s' timed out after %.2f seconds\n" , mac); #ifdef USE_EVENT_BROKER /* get end time */ gettimeofday(&end_time, NULL); -#endif -#ifdef USE_EVENT_BROKER /* send event data to broker */ - broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, global_service_event_handler, processed_command, command_output, NULL); + broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, global_service_event_handler, processed_command, command_output, NULL); #endif /* free memory */ @@ -316,7 +342,6 @@ int run_service_event_handler(nagios_macros *mac, service *svc) { char *raw_logentry = NULL; char *processed_logentry = NULL; char *command_output = NULL; - int early_timeout = FALSE; double exectime = 0.0; int result = 0; #ifdef USE_EVENT_BROKER @@ -369,7 +394,7 @@ int run_service_event_handler(nagios_macros *mac, service *svc) { /* send event data to broker */ end_time.tv_sec = 0L; end_time.tv_usec = 0L; - neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, svc->event_handler, processed_command, NULL, NULL); + neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, svc->event_handler, processed_command, NULL, NULL); /* neb module wants to override (or cancel) the event handler - perhaps it will run the eventhandler itself */ if(neb_result == NEBERROR_CALLBACKOVERRIDE) { @@ -381,21 +406,15 @@ int run_service_event_handler(nagios_macros *mac, service *svc) { #endif /* run the command through a worker */ - /* XXX FIXME make base/workers.c handle the eventbroker stuff below */ - result = wproc_run(WPJOB_SVC_EVTHANDLER, processed_command, event_handler_timeout, mac); - - /* check to see if the event handler timed out */ - if(early_timeout == TRUE) - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, "Warning: Service event handler command '%s' timed out after %d seconds\n", processed_command, event_handler_timeout); + result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, + "Warning: Service event handler command '%s' timed out after %.2f seconds\n" , mac); #ifdef USE_EVENT_BROKER /* get end time */ gettimeofday(&end_time, NULL); -#endif -#ifdef USE_EVENT_BROKER /* send event data to broker */ - broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, svc->event_handler, processed_command, command_output, NULL); + broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, SERVICE_EVENTHANDLER, (void *)svc, svc->current_state, svc->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, svc->event_handler, processed_command, command_output, NULL); #endif /* free memory */ @@ -457,7 +476,6 @@ int run_global_host_event_handler(nagios_macros *mac, host *hst) { char *raw_logentry = NULL; char *processed_logentry = NULL; char *command_output = NULL; - int early_timeout = FALSE; double exectime = 0.0; int result = 0; #ifdef USE_EVENT_BROKER @@ -513,7 +531,7 @@ int run_global_host_event_handler(nagios_macros *mac, host *hst) { /* send event data to broker */ end_time.tv_sec = 0L; end_time.tv_usec = 0L; - neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, global_host_event_handler, processed_command, NULL, NULL); + neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, global_host_event_handler, processed_command, NULL, NULL); /* neb module wants to override (or cancel) the event handler - perhaps it will run the eventhandler itself */ if(neb_result == NEBERROR_CALLBACKOVERRIDE) { @@ -525,21 +543,15 @@ int run_global_host_event_handler(nagios_macros *mac, host *hst) { #endif /* run the command through a worker */ - /* XXX FIXME make base/workers.c handle the eventbroker stuff below */ - wproc_run(WPJOB_GLOBAL_HOST_EVTHANDLER, processed_command, event_handler_timeout, mac); - - /* check for a timeout in the execution of the event handler command */ - if(early_timeout == TRUE) - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, "Warning: Global host event handler command '%s' timed out after %d seconds\n", processed_command, event_handler_timeout); + result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, + "Warning: Global host event handler command '%s' timed out after %.2f seconds\n" , mac); #ifdef USE_EVENT_BROKER /* get end time */ gettimeofday(&end_time, NULL); -#endif -#ifdef USE_EVENT_BROKER /* send event data to broker */ - broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, global_host_event_handler, processed_command, command_output, NULL); + broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, GLOBAL_HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, global_host_event_handler, processed_command, command_output, NULL); #endif /* free memory */ @@ -559,7 +571,6 @@ int run_host_event_handler(nagios_macros *mac, host *hst) { char *raw_logentry = NULL; char *processed_logentry = NULL; char *command_output = NULL; - int early_timeout = FALSE; double exectime = 0.0; int result = 0; #ifdef USE_EVENT_BROKER @@ -611,7 +622,7 @@ int run_host_event_handler(nagios_macros *mac, host *hst) { /* send event data to broker */ end_time.tv_sec = 0L; end_time.tv_usec = 0L; - neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, hst->event_handler, processed_command, NULL, NULL); + neb_result = broker_event_handler(NEBTYPE_EVENTHANDLER_START, NEBFLAG_NONE, NEBATTR_NONE, HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, hst->event_handler, processed_command, NULL, NULL); /* neb module wants to override (or cancel) the event handler - perhaps it will run the eventhandler itself */ if(neb_result == NEBERROR_CALLBACKOVERRIDE) { @@ -623,20 +634,15 @@ int run_host_event_handler(nagios_macros *mac, host *hst) { #endif /* run the command through a worker */ - result = wproc_run(WPJOB_HOST_EVTHANDLER, processed_command, event_handler_timeout, mac); - - /* check to see if the event handler timed out */ - if(early_timeout == TRUE) - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, "Warning: Host event handler command '%s' timed out after %d seconds\n", processed_command, event_handler_timeout); + result = wproc_run_callback(processed_command, event_handler_timeout, evthandler_complete_with_msg, + "Warning: Host event handler command '%s' timed out after %.2f seconds\n" , mac); #ifdef USE_EVENT_BROKER /* get end time */ gettimeofday(&end_time, NULL); -#endif -#ifdef USE_EVENT_BROKER /* send event data to broker */ - broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, early_timeout, result, hst->event_handler, processed_command, command_output, NULL); + broker_event_handler(NEBTYPE_EVENTHANDLER_END, NEBFLAG_NONE, NEBATTR_NONE, HOST_EVENTHANDLER, (void *)hst, hst->current_state, hst->state_type, start_time, end_time, exectime, event_handler_timeout, FALSE, result, hst->event_handler, processed_command, command_output, NULL); #endif /* free memory */ diff --git a/base/workers.c b/base/workers.c index 881e434..a6db4f5 100644 --- a/base/workers.c +++ b/base/workers.c @@ -19,10 +19,10 @@ struct wproc_worker; struct wproc_job { unsigned int id; - unsigned int type; unsigned int timeout; char *command; - void *arg; + void (*callback)(struct wproc_result *, void *, int); + void *data; struct wproc_worker *wp; }; @@ -52,40 +52,9 @@ static struct wproc_list workers = {0, 0, NULL}; static dkhash_table *specialized_workers; static struct wproc_list *to_remove = NULL; -typedef struct wproc_callback_job { - void *data; - void (*callback)(struct wproc_result *, void *, int); -} wproc_callback_job; - -typedef struct wproc_object_job { - char *contact_name; - char *host_name; - char *service_description; -} wproc_object_job; - unsigned int wproc_num_workers_online = 0, wproc_num_workers_desired = 0; unsigned int wproc_num_workers_spawned = 0; -#define tv2float(tv) ((float)((tv)->tv_sec) + ((float)(tv)->tv_usec) / 1000000.0) - -static const char *wpjob_type_name(unsigned int type) -{ - switch (type) { - case WPJOB_CHECK: return "CHECK"; - case WPJOB_NOTIFY: return "NOTIFY"; - case WPJOB_OCSP: return "OCSP"; - case WPJOB_OCHP: return "OCHP"; - case WPJOB_GLOBAL_SVC_EVTHANDLER: return "GLOBAL SERVICE EVENTHANDLER"; - case WPJOB_SVC_EVTHANDLER: return "SERVICE EVENTHANDLER"; - case WPJOB_GLOBAL_HOST_EVTHANDLER: return "GLOBAL HOST EVENTHANDLER"; - case WPJOB_HOST_EVTHANDLER: return "HOST EVENTHANDLER"; - case WPJOB_CALLBACK: return "CALLBACK"; - case WPJOB_HOST_PERFDATA: return "HOST PERFDATA"; - case WPJOB_SVC_PERFDATA: return "SERVICE PERFDATA"; - } - return "UNKNOWN"; -} - static void wproc_logdump_buffer(int level, int show, const char *prefix, char *buf) { char *ptr, *eol; @@ -223,7 +192,7 @@ static struct wproc_worker *get_worker(const char *cmd) return wp_list->wps[wp_list->idx++ % wp_list->len]; } -static struct wproc_job *create_job(int type, void *arg, time_t timeout, const char *cmd) +static struct wproc_job *create_job(void (*callback)(struct wproc_result *, void *, int), void *data, time_t timeout, const char *cmd) { struct wproc_job *job; struct wproc_worker *wp; @@ -240,8 +209,8 @@ static struct wproc_job *create_job(int type, void *arg, time_t timeout, const c job->wp = wp; job->id = get_job_id(wp); - job->type = type; - job->arg = arg; + job->callback = callback; + job->data = data; job->timeout = timeout; if (fanout_add(wp->jobs, job->id, job) < 0 || !(job->command = strdup(cmd))) { free(job); @@ -253,16 +222,10 @@ static struct wproc_job *create_job(int type, void *arg, time_t timeout, const c static void run_job_callback(struct wproc_job *job, struct wproc_result *wpres, int val) { - wproc_callback_job *cj; - - if (!job || !job->arg) + if (!job || !job->callback) return; - cj = (struct wproc_callback_job *)job->arg; - - if (!cj->callback) - return; - cj->callback(wpres, cj->data, val); - cj->callback = NULL; + job->callback(wpres, job->data, val); + job->callback = NULL; } static void destroy_job(struct wproc_job *job) @@ -270,31 +233,8 @@ static void destroy_job(struct wproc_job *job) if (!job) return; - switch (job->type) { - case WPJOB_CHECK: - free_check_result(job->arg); - free(job->arg); - break; - case WPJOB_NOTIFY: - case WPJOB_OCSP: - case WPJOB_OCHP: - free(job->arg); - break; - - case WPJOB_GLOBAL_SVC_EVTHANDLER: - case WPJOB_SVC_EVTHANDLER: - case WPJOB_GLOBAL_HOST_EVTHANDLER: - case WPJOB_HOST_EVTHANDLER: - /* these require nothing special */ - break; - case WPJOB_CALLBACK: - /* call with NULL result to make callback clean things up */ - run_job_callback(job, NULL, 0); - break; - default: - logit(NSLOG_RUNTIME_WARNING, TRUE, "wproc: Unknown job type: %d\n", job->type); - break; - } + /* Run the callback once, but with wpres=NULL if the job is destroyed */ + run_job_callback(job, NULL, 0); my_free(job->command); if (job->wp) { @@ -433,41 +373,6 @@ static int str2timeval(char *str, struct timeval *tv) return 0; } -static int handle_worker_check(wproc_result *wpres, struct wproc_worker *wp, struct wproc_job *job) -{ - int result = ERROR; - check_result *cr = (check_result *)job->arg; - - memcpy(&cr->rusage, &wpres->rusage, sizeof(wpres->rusage)); - cr->start_time.tv_sec = wpres->start.tv_sec; - cr->start_time.tv_usec = wpres->start.tv_usec; - cr->finish_time.tv_sec = wpres->stop.tv_sec; - cr->finish_time.tv_usec = wpres->stop.tv_usec; - if (WIFEXITED(wpres->wait_status)) { - cr->return_code = WEXITSTATUS(wpres->wait_status); - } else { - cr->return_code = STATE_UNKNOWN; - } - - if (wpres->outstd && *wpres->outstd) { - cr->output = strdup(wpres->outstd); - } else if (wpres->outerr) { - asprintf(&cr->output, "(No output on stdout) stderr: %s", wpres->outerr); - } else { - cr->output = NULL; - } - - cr->early_timeout = wpres->early_timeout; - cr->exited_ok = wpres->exited_ok; - cr->engine = NULL; - cr->source = wp->name; - - process_check_result(cr); - free_check_result(cr); - - return result; -} - /* * parses a worker result. We do no strdup()'s here, so when * kvv is destroyed, all references to strings will become @@ -493,7 +398,7 @@ static int parse_worker_result(wproc_result *wpres, struct kvvec *kvv) wpres->job_id = atoi(value); break; case WPRES_type: - wpres->type = atoi(value); + /* FIXME: ignored */ break; case WPRES_command: wpres->command = value; @@ -587,7 +492,6 @@ static void fo_reassign_wproc_job(void *job_) static int handle_worker_result(int sd, int events, void *arg) { - wproc_object_job *oj = NULL; char *buf, *error_reason = NULL; unsigned long size; int ret; @@ -650,12 +554,6 @@ static int handle_worker_result(int sd, int events, void *arg) wpres.job_id, wp->name); continue; } - if (wpres.type != job->type) { - logit(NSLOG_RUNTIME_WARNING, TRUE, "wproc: %s claims job %d is type %d, but we think it's type %d\n", - wp->name, job->id, wpres.type, job->type); - break; - } - oj = (wproc_object_job *)job->arg; /* * ETIME ("Timer expired") doesn't really happen @@ -674,24 +572,11 @@ static int handle_worker_result(int sd, int events, void *arg) WCOREDUMP(wpres.wait_status) ? " (core dumped)" : "", tv_delta_f(&wpres.start, &wpres.stop)); } - else if (job->type != WPJOB_CHECK && WEXITSTATUS(wpres.wait_status) != 0) { - asprintf(&error_reason, "is a non-check helper but exited with return code %d", - WEXITSTATUS(wpres.wait_status)); - } + if (error_reason) { - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: %s job %d from worker %s %s", - wpjob_type_name(job->type), job->id, wp->name, error_reason); + logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: job %d from worker %s %s", + job->id, wp->name, error_reason); logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: command: %s\n", job->command); - if (job->type != WPJOB_CHECK && oj) { - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: host=%s; service=%s; contact=%s\n", - oj->host_name ? oj->host_name : "(none)", - oj->service_description ? oj->service_description : "(none)", - oj->contact_name ? oj->contact_name : "(none)"); - } else if (oj) { - struct check_result *cr = (struct check_result *)job->arg; - logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: host=%s; service=%s;\n", - cr->host_name, cr->service_description); - } logit(NSLOG_RUNTIME_ERROR, TRUE, "wproc: early_timeout=%d; exited_ok=%d; wait_status=%d; error_code=%d;\n", wpres.early_timeout, wpres.exited_ok, wpres.wait_status, wpres.error_code); wproc_logdump_buffer(NSLOG_RUNTIME_ERROR, TRUE, "wproc: stderr", wpres.outerr); @@ -699,74 +584,9 @@ static int handle_worker_result(int sd, int events, void *arg) } my_free(error_reason); - switch (job->type) { - case WPJOB_CHECK: - ret = handle_worker_check(&wpres, wp, job); - break; - case WPJOB_NOTIFY: - if (wpres.early_timeout) { - if (oj->service_description) { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of service '%s' on host '%s' by command '%s' timed out after %.2f seconds\n", - oj->contact_name, oj->service_description, - oj->host_name, job->command, - tv2float(&wpres.runtime)); - } else { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Notifying contact '%s' of host '%s' by command '%s' timed out after %.2f seconds\n", - oj->contact_name, oj->host_name, - job->command, tv2float(&wpres.runtime)); - } - } - break; - case WPJOB_OCSP: - if (wpres.early_timeout) { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCSP command '%s' for service '%s' on host '%s' timed out after %.2f seconds\n", - job->command, oj->service_description, oj->host_name, - tv2float(&wpres.runtime)); - } - break; - case WPJOB_OCHP: - if (wpres.early_timeout) { - logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: OCHP command '%s' for host '%s' timed out after %.2f seconds\n", - job->command, oj->host_name, tv2float(&wpres.runtime)); - } - break; - case WPJOB_GLOBAL_SVC_EVTHANDLER: - if (wpres.early_timeout) { - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, - "Warning: Global service event handler command '%s' timed out after %.2f seconds\n", - job->command, tv2float(&wpres.runtime)); - } - break; - case WPJOB_SVC_EVTHANDLER: - if (wpres.early_timeout) { - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, - "Warning: Service event handler command '%s' timed out after %.2f seconds\n", - job->command, tv2float(&wpres.runtime)); - } - break; - case WPJOB_GLOBAL_HOST_EVTHANDLER: - if (wpres.early_timeout) { - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, - "Warning: Global host event handler command '%s' timed out after %.2f seconds\n", - job->command, tv2float(&wpres.runtime)); - } - break; - case WPJOB_HOST_EVTHANDLER: - if (wpres.early_timeout) { - logit(NSLOG_EVENT_HANDLER | NSLOG_RUNTIME_WARNING, TRUE, - "Warning: Host event handler command '%s' timed out after %.2f seconds\n", - job->command, tv2float(&wpres.runtime)); - } - break; + /* Run the callback with the wpres struct as argument */ + run_job_callback(job, &wpres, 0); - case WPJOB_CALLBACK: - run_job_callback(job, &wpres, 0); - break; - - default: - logit(NSLOG_RUNTIME_WARNING, TRUE, "Worker %d: Unknown jobtype: %d\n", wp->pid, job->type); - break; - } destroy_job(job); } @@ -989,7 +809,7 @@ static int wproc_run_job(struct wproc_job *job, nagios_macros *mac) return ERROR; kvvec_addkv(&kvv, "job_id", (char *)mkstr("%d", job->id)); - kvvec_addkv(&kvv, "type", (char *)mkstr("%d", job->type)); + kvvec_addkv(&kvv, "type", "0"); /* Dummy field for compatibility */ kvvec_addkv(&kvv, "command", job->command); kvvec_addkv(&kvv, "timeout", (char *)mkstr("%u", job->timeout)); kvvb = build_kvvec_buf(&kvv); @@ -1010,94 +830,11 @@ static int wproc_run_job(struct wproc_job *job, nagios_macros *mac) return result; } -static wproc_object_job *create_object_job(char *cname, char *hname, char *sdesc) -{ - wproc_object_job *oj; - - oj = calloc(1, sizeof(*oj)); - if (oj) { - oj->host_name = hname; - if (cname) - oj->contact_name = cname; - if (sdesc) - oj->service_description = sdesc; - } - - return oj; -} - -int wproc_notify(char *cname, char *hname, char *sdesc, char *cmd, nagios_macros *mac) -{ - struct wproc_job *job; - wproc_object_job *oj; - - if (!(oj = create_object_job(cname, hname, sdesc))) - return ERROR; - - job = create_job(WPJOB_NOTIFY, oj, notification_timeout, cmd); - - return wproc_run_job(job, mac); -} - -int wproc_run_service_job(int jtype, int timeout, service *svc, char *cmd, nagios_macros *mac) -{ - struct wproc_job *job; - wproc_object_job *oj; - - if (!(oj = create_object_job(NULL, svc->host_name, svc->description))) - return ERROR; - - job = create_job(jtype, oj, timeout, cmd); - - return wproc_run_job(job, mac); -} - -int wproc_run_host_job(int jtype, int timeout, host *hst, char *cmd, nagios_macros *mac) -{ - struct wproc_job *job; - wproc_object_job *oj; - - if (!(oj = create_object_job(NULL, hst->name, NULL))) - return ERROR; - - job = create_job(jtype, oj, timeout, cmd); - - return wproc_run_job(job, mac); -} - -int wproc_run_check(check_result *cr, char *cmd, nagios_macros *mac) -{ - struct wproc_job *job; - int timeout; - - if (cr->service_description) - timeout = service_check_timeout; - else - timeout = host_check_timeout; - - job = create_job(WPJOB_CHECK, cr, timeout, cmd); - return wproc_run_job(job, mac); -} - -int wproc_run(int jtype, char *cmd, int timeout, nagios_macros *mac) -{ - struct wproc_job *job; - - job = create_job(jtype, NULL, timeout, cmd); - return wproc_run_job(job, mac); -} - -int wproc_run_callback(char *cmd, int timeout, +int wproc_run_callback(char *command_line, int timeout, void (*cb)(struct wproc_result *, void *, int), void *data, nagios_macros *mac) { struct wproc_job *job; - struct wproc_callback_job *cj; - if (!(cj = calloc(1, sizeof(*cj)))) - return ERROR; - cj->callback = cb; - cj->data = data; - - job = create_job(WPJOB_CALLBACK, cj, timeout, cmd); + job = create_job(cb, data, timeout, command_line); return wproc_run_job(job, mac); } diff --git a/include/workers.h b/include/workers.h index b11229b..cf28287 100644 --- a/include/workers.h +++ b/include/workers.h @@ -2,28 +2,15 @@ #define INCLUDE_workers_h__ #include "lib/libnagios.h" #include "lib/worker.h" -#include "nagios.h" /* for check_result definition */ - -/* different jobtypes. We add more as needed */ -#define WPJOB_CHECK 0 -#define WPJOB_NOTIFY 1 -#define WPJOB_OCSP 2 -#define WPJOB_OCHP 3 -#define WPJOB_GLOBAL_SVC_EVTHANDLER 4 -#define WPJOB_SVC_EVTHANDLER 5 -#define WPJOB_GLOBAL_HOST_EVTHANDLER 6 -#define WPJOB_HOST_EVTHANDLER 7 -#define WPJOB_CALLBACK 8 -#define WPJOB_HOST_PERFDATA 9 -#define WPJOB_SVC_PERFDATA 10 #define WPROC_FORCE (1 << 0) NAGIOS_BEGIN_DECL; +/* FIXME: Needs to be somewhere better... */ typedef struct wproc_result { unsigned int job_id; - unsigned int type; + unsigned int type; /* not used */ time_t timeout; struct timeval start; struct timeval stop; @@ -49,12 +36,7 @@ extern int wproc_can_spawn(struct load_control *lc); extern void free_worker_memory(int flags); extern int workers_alive(void); extern int init_workers(int desired_workers); -extern int wproc_run_check(check_result *cr, char *cmd, nagios_macros *mac); -extern int wproc_notify(char *cname, char *hname, char *sdesc, char *cmd, nagios_macros *mac); -extern int wproc_run(int job_type, char *cmd, int timeout, nagios_macros *mac); -extern int wproc_run_service_job(int jtype, int timeout, service *svc, char *cmd, nagios_macros *mac); -extern int wproc_run_host_job(int jtype, int timeout, host *hst, char *cmd, nagios_macros *mac); -extern int wproc_run_callback(char *cmt, int timeout, void (*cb)(struct wproc_result *, void *, int), void *data, nagios_macros *mac); +extern int wproc_run_callback(char *command_line, int timeout, void (*cb)(struct wproc_result *, void *, int), void *data, nagios_macros *mac); NAGIOS_END_DECL; #endif diff --git a/xdata/xpddefault.c b/xdata/xpddefault.c index 62bfe56..94bddbf 100644 --- a/xdata/xpddefault.c +++ b/xdata/xpddefault.c @@ -41,6 +41,13 @@ static FILE *service_perfdata_fp = NULL; static int host_perfdata_fd = -1; static int service_perfdata_fd = -1; +/******************************************************************/ +/********************** CHECK COMPLETE CALLBACK *******************/ +/******************************************************************/ + +static void perfdata_command_callback(struct wproc_result *wpres, void *data, int flags) { + /* Dummy method as callback. we don't care about the result for now */ +} /******************************************************************/ /************** INITIALIZATION & CLEANUP FUNCTIONS ****************/ @@ -338,7 +345,7 @@ int xpddefault_run_service_performance_data_command(nagios_macros *mac, service log_debug_info(DEBUGL_PERFDATA, 2, "Processed service performance data command line: %s\n", processed_command_line); /* run the command */ - wproc_run(WPJOB_SVC_PERFDATA, processed_command_line, perfdata_timeout, NULL); + wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_command_callback, NULL, NULL); /* free memory */ my_free(processed_command_line); @@ -379,7 +386,7 @@ int xpddefault_run_host_performance_data_command(nagios_macros *mac, host *hst) log_debug_info(DEBUGL_PERFDATA, 2, "Processed host performance data command line: %s\n", processed_command_line); /* run the command */ - wproc_run(WPJOB_HOST_PERFDATA, processed_command_line, perfdata_timeout, NULL); + wproc_run_callback(processed_command_line, perfdata_timeout, perfdata_command_callback, NULL, NULL); /* free memory */ my_free(processed_command_line); -- 1.7.1 |
From: Andreas E. <ae...@op...> - 2013-08-19 14:16:42
|
On 2013-08-19 14:59, Robin Sonefors wrote: > When a check - host or service - fails to exit properly, it always > becomes CRITICAL. When an active host check times out, it goes into > UNKNOWN. When a service check times out, nothing is done to the state, > which led to my system telling me that a check that timed out was an OK > check. > > This thus sets the state to UNKNOWN when the check didn't exit in time, > because that seems to make more sense and is analogous with what's done > for host checks. > > Still, the whole CRITICAL vs UNKNOWN descibed above makes me a bit less > confident in my fix than I'd like - does this need more work? > Well, it needs to honor "service_check_timeout_state", which is a global variable. Apart from that, it might be nice to save the output, if there is any such, or create the output in base/workers.c where we handle all non-ok helper exit codes anyway. I think the check for jobs of type WPJOB_CHECK already has their own case label, so the effort shouldn't be huge. Saving output will only be useful in very (very) rare cases though, so concatenating it with the non-null plugin_output should work ok. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Robin S. <rob...@op...> - 2013-08-19 13:15:34
|
When a check - host or service - fails to exit properly, it always becomes CRITICAL. When an active host check times out, it goes into UNKNOWN. When a service check times out, nothing is done to the state, which led to my system telling me that a check that timed out was an OK check. This thus sets the state to UNKNOWN when the check didn't exit in time, because that seems to make more sense and is analogous with what's done for host checks. Still, the whole CRITICAL vs UNKNOWN descibed above makes me a bit less confident in my fix than I'd like - does this need more work? Signed-off-by: Robin Sonefors <rob...@op...> --- base/checks.c | 1 + 1 file changed, 1 insertion(+) diff --git a/base/checks.c b/base/checks.c index 76a9b7d..c601b9f 100644 --- a/base/checks.c +++ b/base/checks.c @@ -416,6 +416,7 @@ int handle_async_service_check_result(service *temp_service, check_result *queue if(queued_check_result->early_timeout == TRUE) { logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Check of service '%s' on host '%s' timed out after %.3fs!\n", temp_service->description, temp_service->host_name, temp_service->execution_time); asprintf(&temp_service->plugin_output, "(Service check timed out after %.2lf seconds)\n", temp_service->execution_time); + temp_service->current_state = STATE_UNKNOWN; } /* if there was some error running the command, just skip it (this shouldn't be happening) */ else if(queued_check_result->exited_ok == FALSE) { -- 1.8.3.1 |
From: Andreas E. <ae...@op...> - 2013-08-08 15:52:34
|
On 2013-08-08 15:17, Ton Voon wrote: > > On 7 Aug 2013, at 08:51, Andreas Ericsson wrote: > >> On 2013-08-06 19:16, Ton Voon wrote: >>> Hi! >>> >>> We've published a list of patches for Nagios 4: >>> http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4 >>> >>> We'd be happy if you could review if these are acceptable for >>> future inclusion or if anyone else finds them useful. >>> >> >> I'd like to get patches with commit messages and proper author and >> signed-off-by info. Since we're using git for Nagios now, it'd go a >> long way in making sure everyone gets credit for the work they've >> done. >> >> The patches also need to apply cleanly to the latest master. >> >> You may want to clone the official Nagios repo, apply your patches >> on top of it and then send me a pull request for github or some >> such. >> >> git clone git://git.code.sf.net/p/nagios/nagios nagios-core >> >> should get you the very latest. If you apply your patches on top >> of 'master' and make sure to always do "git pull --rebase" when you >> want to get the latest and greatest you'll quickly see which >> patches either have been applied or which no longer *can* be >> applied. Then you can create a separate repository on github or >> some such and push the changes there. > > OK, we'll convert them into git changes based off master. > However.... > > I'd like an assurance that the changes will be merged before we > promise to convert. Style changes are fine, won't take much time and > will not require retesting, but if we need to refactor object changes > or make larger changes to logic, this will require retesting on our > side. I'd like a reassurance that the time invested will result in a > merge upstream, otherwise we're just wasting time for all of us. > > For instance, your assistance in the "environment macros per-command" > was greatly appreciated and we've coded to the design agreed in the > email conversations, but it hasn't been merged yet. > It has, but it hasn't. The original patch no longer applies, due to extensive remodeling of the worker code. I still have it hanging around though, and I'm fairly inclined towards using it for services and hosts as well, so that environment macros can be set on a service level as well as on a command level. I've also been investigating the chances of doing this without resorting to setenv() (in essence, building a one-off block of the load-time environment macros, and then extending that whenever we hit an environment variable). But yea, I've got that one already. > So I'd like to reverse the question and ask "which changes are most > likely to be accepted, based on the amount of changes required" and > we'll work through that list in order. > Thread-safe calls It's harmless, and I'd apply it straight away if I had a sensible commit message for it, but I don't know why you need this so I can't really write one. While we're on the topic; Please write commit messages in imperative form and present tense, as if you're giving orders to the code for how it should change. Also give a short rationale for why it should change that way. The rules about not commenting out code you want to get rid of still applies though. It just looks terribly hackish when patches meant for upstream contains things like that. Slice services within hosts So long as it's configurable from cgi.cfg and the default stays the same as it is today I'll apply it immediately. It has no impact on loadable modules or other headaches, so that's a nobrainer, really. Check command by time period I feel this is somewhat lacking in efficiency and flexibility, and a much cleaner solution would be to add a filtering functionality to NERD so that checkresults can be shipped off to a third-party addon which can transform checkresults and plugin output as much as it likes. Failing that (which would be enormously cool but also a lot of work), I think it would most likely be best off as a separate module, with custom variables or separate configuration to support it. Supporting patches to run events when a timeperiod becomes active or inactive would still be welcome, obviously. Escalation via notification levels This is best off written as a module, using custom variables to configure it. If core support is required to block notifications to a particular contact, then such patches will naturally be accepted. The normal NEBCALLBACK_CANCEL return code signalling should work just fine for that. Synchronising state data Pretty invasive for quite a small benefit, and with enormous complexities to deal with to make this work properly. How do you handle reading the synced file while checks are in-flight and awaiting being returned to the mothership? They may have been spawned as a result of a bad checkresult on one node, but if the sync status returns them to OK and the in-flight result sets them as bad again, would that mean starting over on the retry-attempts? I see nothing in the patch that handles situations like that, so I'm forced to believe you haven't considered it. Considering the fact that full reloads now take less than a second for your recommended 250 hosts per system, I'm inclined to drop this patch in favour of the regular retention data file unless you can persuade me otherwise. Dependency failure states The patch is a bit hackish in that it internally submits a passive check to make sure everything gets updated. That's not entirely accurate though, and will cause statistical errors. Rewrite it to just update the states as necessary and make sure notifications and whatnot are sent if they should be. That will also fix the problem of bugging out performance data handling routines that expect to always get the same kind of data from each check every time. It will have to be configurable though. I'm thinking "service_dependency_exec_fail_state=(warning,critical,unknown)" and "service_dependency_exec_fail_output=Dependency failure". I'd be all for changing the default, but for now the exec_fail_state thing will have to be -1, meaning "leave it as-is", and I'll make it default to UNKNOWN after 4.0 is released. The same should go for service parents, btw. Passive notifications respecting renotification interval I'm inclined to making this default. It's not documented anywhere which of those options take precedence. If notification_interval is 0, then every notification should be sent for is_volatile services, and if it's anything else we should respect the interval no matter how many notifications happen to arrive. Make those changes and I'll apply the patch and call it a bugfix. Reducing SQL queries part 1: No. Just no. This is a perfect place where you can use an array sized after num_objects.services *inside the module* and look up the latest version of the processed commandline there. Free it when you update it and you'll be sure to always have it handy. Anyone who doesn't use NDOUtils will otherwise be penalized for its shortcomings. Reducing SQL queries part 2: As above, but num_objects.hosts Conditional debugging: I'm more inclined towards making the debug-logging function a C99-style variadic macro (or a gcc style one), while falling back to using the current structure of a lot of unnecessary calls on too old compilers. All that iffing should really be kept at arms length from the coder, or debug-logging won't be used. I'm thinking dbg(DEBUGL_CHECKS, 0, "** Running async check of foo ... "); so we get it short and sweet while we're at it. FIFO processing of passive checkresults I have several objections to this patch. First off; You're sorting on alphanumerical name only, instead of mtime first and then name in case of a tiebreak. That's just plain dumb and has to be amended if this is to be anywhere near useful. Secondly; This doesn't scale. I spent the better part of last year removing everything that touches the filesystem on a scheduled and frequent basis for those precise reasons. Putting crap like that back in is not something I'll do without a very good reason. Thirdly; It builds on obsolete code which is designated for oblivion, so the code would be (relatively) shortlived anyway. Fourthly; *Nothing* can be built on top of such a patch that doesn't absolutely reek of hackishness and bad design. A much more useful patch would be adding support for external commands to the query handler, which doesn't have the limitations of the named pipe (and can give responses to your requests so you know if they were accepted or not) and use a secondary program to ship in passive results any old way you want to the clean, nice and responsive interface which is meant to be extended for things exactly like this. Tolerant time jumps I'm unsure how useful this is. How often does a system clock go backwards except for daylight savings time? I also worry slightly about the consequences. OTOH, most of the worry comes as a result from the very lax default for time_change_threshold (900 seconds). A time jump of 899 seconds wouldn't trigger a rescheduling, but would wreak serious havoc with the checks. Listing related contact groups The patch needs to be changed so that the (possibly very long) list of contactgroups isn't computed until it's requested instead of before we even start macro substitution. Otherwise we'll be computing that macro for each and every host and service check we execute, which stupid beyond belief. Particularly considering the fact that object config can't change during runtime (yet), so the list could just as well be cached if one would so desire. For bonus points, also add CONTACTLIST, which fetches contacts from both the object directly and the contactgroups assigned. Environment macros per-command I wish you'd stop talking BS on that site of yours. Removal of environment macros has very little to do with the speedups in Nagios 4. Those improvements are the result of good design and proper implementation of good design. Now to technical issues; I have the patch (as I've said) and I do intend to push it as soon as I've had time to polish it and bring it forward to the current "master". It's quite invasive though, and I've been trying to figure out what it is about it that feels wrong. Then I've also forgotten about it, so this reminder was pretty good. Thanks. Allowing semi-colons in commands Well, it's a neat idea, but the patch is so incredibly ugly that I just can't bring myself to accept it. If you did something to keep state and allowed semicolons in strings or as escaped creatures (with a backslash, pretty please; There's enough home-cooked escaping in Nagios already), I'd be willing to accept it. Tilde symbol in commands Absolutely out of the question. This patch deliberately breaks a library to exploit its caller's fallback, which happens to suit your particular needs in this case. No. Nonono. You can just configure your way around it by using command_line /bin/sh -c '$USER1$/plugin --arg=lala' for the command in question, or implement it separately but making it configurable as "expand_tilde_in_commands=<bool>" in nagios.cfg and doing the replacement manually. It's not that hard, really. Fix max_concurrent_checks decrementing Obviously correct, although the patch is incomplete. In case we fail to launch a job, it will have no reference in the scheduling queue but will be marked as "is_executing" which will cause it to not even get freshness checked. I'll amend the patch with everything it needs and apply. Good catch though. Extra external commands This looks clean, correct and useful. I'll need a commit message, an author and a signed-off-by before I can accept it though. Thanks -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Ton V. <ton...@op...> - 2013-08-08 13:38:55
|
On 7 Aug 2013, at 08:51, Andreas Ericsson wrote: > On 2013-08-06 19:16, Ton Voon wrote: >> Hi! >> >> We've published a list of patches for Nagios 4: >> http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4 >> >> We'd be happy if you could review if these are acceptable for future >> inclusion or if anyone else finds them useful. >> > > I'd like to get patches with commit messages and proper author and > signed-off-by info. Since we're using git for Nagios now, it'd go > a long way in making sure everyone gets credit for the work they've > done. > > The patches also need to apply cleanly to the latest master. > > You may want to clone the official Nagios repo, apply your patches on > top of it and then send me a pull request for github or some such. > > git clone git://git.code.sf.net/p/nagios/nagios nagios-core > > should get you the very latest. If you apply your patches on top of > 'master' and make sure to always do "git pull --rebase" when you want > to get the latest and greatest you'll quickly see which patches either > have been applied or which no longer *can* be applied. Then you can > create a separate repository on github or some such and push the changes > there. OK, we'll convert them into git changes based off master. However.... I'd like an assurance that the changes will be merged before we promise to convert. Style changes are fine, won't take much time and will not require retesting, but if we need to refactor object changes or make larger changes to logic, this will require retesting on our side. I'd like a reassurance that the time invested will result in a merge upstream, otherwise we're just wasting time for all of us. For instance, your assistance in the "environment macros per-command" was greatly appreciated and we've coded to the design agreed in the email conversations, but it hasn't been merged yet. So I'd like to reverse the question and ask "which changes are most likely to be accepted, based on the amount of changes required" and we'll work through that list in order. > > > On a first inspection though; > * Don't comment out code. Bringing back dead code is what the VCS is > for, and keeping it around is just plain dumb. If it shouldn't be in > there, just remove it. In the same vein; Don't add commented-out code. > > * Avoid C++ comments. I know they're supported in C99, which I'm rooting > for as the least version supported, but it's against the current style. > > * Don't mix spaces and tabs for indentation, unless it's continuation- > indentation of a multi-line statement or condition. Stick to the style > in the file you're editing, and *look* at the patch before you send it > somewhere. > > * Avoid comments saying things like "Opsview specific foobar" if you > want to have the patches included. If you *don't* want those patches > included, don't send them to me or point me to where they are. It takes > up a lot of time to remove crap like that, and I have no interest in > cleaning up after anyone else. I'm messy enough as it is on my own. > > * Don't augment objects (such as hosts, services, commands) with new > variables. Doing so means Nagios 4.1, and I can't take patches like > that until Nagios 4 is at least released as stable. All objects have > an 'id' field which means you can look up any extra data you want in > O(1) time, provided you just initialize an array of size > num_objects.$desired_object_type_in_plural before we enter the event > loop. > > * Make patches the most scalable you can. For the "check_time_period" > thing, you'd be far better off adding code to detect timeperiod changes, > notice which timeperiods are used to change commands and make a one-off > swap for all the affected commands as the desired timeperiod comes > either in or out of effect. > > * Don't build on deprecated technology, such as external files for > commands and/or check results |
From: Andreas E. <ae...@op...> - 2013-08-07 07:51:49
|
On 2013-08-06 19:16, Ton Voon wrote: > Hi! > > We've published a list of patches for Nagios 4: > http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4 > > We'd be happy if you could review if these are acceptable for future > inclusion or if anyone else finds them useful. > I'd like to get patches with commit messages and proper author and signed-off-by info. Since we're using git for Nagios now, it'd go a long way in making sure everyone gets credit for the work they've done. The patches also need to apply cleanly to the latest master. You may want to clone the official Nagios repo, apply your patches on top of it and then send me a pull request for github or some such. git clone git://git.code.sf.net/p/nagios/nagios nagios-core should get you the very latest. If you apply your patches on top of 'master' and make sure to always do "git pull --rebase" when you want to get the latest and greatest you'll quickly see which patches either have been applied or which no longer *can* be applied. Then you can create a separate repository on github or some such and push the changes there. On a first inspection though; * Don't comment out code. Bringing back dead code is what the VCS is for, and keeping it around is just plain dumb. If it shouldn't be in there, just remove it. In the same vein; Don't add commented-out code. * Avoid C++ comments. I know they're supported in C99, which I'm rooting for as the least version supported, but it's against the current style. * Don't mix spaces and tabs for indentation, unless it's continuation- indentation of a multi-line statement or condition. Stick to the style in the file you're editing, and *look* at the patch before you send it somewhere. * Avoid comments saying things like "Opsview specific foobar" if you want to have the patches included. If you *don't* want those patches included, don't send them to me or point me to where they are. It takes up a lot of time to remove crap like that, and I have no interest in cleaning up after anyone else. I'm messy enough as it is on my own. * Don't augment objects (such as hosts, services, commands) with new variables. Doing so means Nagios 4.1, and I can't take patches like that until Nagios 4 is at least released as stable. All objects have an 'id' field which means you can look up any extra data you want in O(1) time, provided you just initialize an array of size num_objects.$desired_object_type_in_plural before we enter the event loop. * Make patches the most scalable you can. For the "check_time_period" thing, you'd be far better off adding code to detect timeperiod changes, notice which timeperiods are used to change commands and make a one-off swap for all the affected commands as the desired timeperiod comes either in or out of effect. * Don't build on deprecated technology, such as external files for commands and/or check results. -- Andreas Ericsson and...@op... OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 Considering the successes of the wars on alcohol, poverty, drugs and terror, I think we should give some serious thought to declaring war on peace. |
From: Daniel W. <dwi...@gm...> - 2013-08-07 02:45:02
|
Looks like some good additions. Since I'm in the middle of doing some in-depth performance tuning/comparison between 3.2.3 and 4.0 I might to throw a couple of these on and see how things change. Thanks! Dan On Aug 6, 2013, at 12:16 PM, Ton Voon <ton...@op...> wrote: > Hi! > > We've published a list of patches for Nagios 4: http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4 > > We'd be happy if you could review if these are acceptable for future inclusion or if anyone else finds them useful. > > Ton > > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk > _______________________________________________ > Nagios-devel mailing list > Nag...@li... > https://lists.sourceforge.net/lists/listinfo/nagios-devel |
From: Ton V. <ton...@op...> - 2013-08-06 17:36:38
|
Hi! We've published a list of patches for Nagios 4: http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4 We'd be happy if you could review if these are acceptable for future inclusion or if anyone else finds them useful. Ton |
From: Max S. <max...@op...> - 2013-08-05 12:24:46
|
From: Max Sikström <msi...@op...> Default behaviour is still that everything should be possible to urlencode. So keep it that way. When zeroing clean_option (=flags for which types of escaping the macro is allowed to pass), it should actually be set to url encode, not zero. Signed-off-by: Max Sikström <msi...@op...> --- common/macros.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/common/macros.c b/common/macros.c index fa22233..e30634a 100644 --- a/common/macros.c +++ b/common/macros.c @@ -407,8 +407,9 @@ int grab_macro_value_r(nagios_macros *mac, char *macro_buffer, char **output, in if(macro_buffer == NULL || free_macro == NULL) return ERROR; + /* Everything (if not stated otherwise) should be possible to urlencode */ if(clean_options) - *clean_options = 0; + *clean_options = URL_ENCODE_MACRO_CHARS; /* * We handle argv and user macros first, since those are by far -- 1.7.1 |