You can subscribe to this list here.
2007 |
Jan
|
Feb
(3) |
Mar
(8) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(6) |
Nov
(6) |
Dec
(6) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(14) |
Feb
(3) |
Mar
(10) |
Apr
(73) |
May
(17) |
Jun
(2) |
Jul
(11) |
Aug
(21) |
Sep
(11) |
Oct
(9) |
Nov
(21) |
Dec
|
2009 |
Jan
|
Feb
(4) |
Mar
(2) |
Apr
(1) |
May
(7) |
Jun
(6) |
Jul
(1) |
Aug
(5) |
Sep
(1) |
Oct
(1) |
Nov
|
Dec
(1) |
2010 |
Jan
(2) |
Feb
(41) |
Mar
(71) |
Apr
(3) |
May
(7) |
Jun
(4) |
Jul
(6) |
Aug
(3) |
Sep
|
Oct
|
Nov
(16) |
Dec
(2) |
2011 |
Jan
(4) |
Feb
|
Mar
(8) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Hiren P. <hir...@gm...> - 2010-05-06 06:01:09
|
On Wed, 05 May 2010 15:18:35 -0600 Eric Schoeller <esc...@us...> wrote: > I think I've stumbled across the icinga project before, the website looks familiar. How does the feature set compare with Ninja/Merlin or the "op5 monitor appliance" hosted at op5.org? > hi, I've just joined the project, icinga is a fork of nagios 3, hoping to bring in feature requests and improvements at a faster pace than nagios. I'm not too familiar with the op5 product, but those implemented as nagios neb modules will work with icinga, as the core hasn't changed enough yet. > Your feature list for icinga includes a brief blurb about "redundancy with distributed monitoring" ... how exactly is that implemented now? I assume that you're looking to integrate DNX with icinga to improve upon what method you're already using? Does icinga provide redundancy/load-balancing for the master nagios server? > currently icinga core is mostly the nagios core 3 with some bug fixes, no other major features (aside from ido database stuff) has gone in. there are discussions on the devel list about improving the core performance, and bringing in distributed monitoring as a core feature (instead of the official active/passive solution, or by using modules). > Have you tried using DNX with icinga yet? Depending on your fork of nagios, it might just work :) > personally not as yet, but I'm confident that it will work, core hasn't changed enough to break neb module support. we like what dnx does, and were thinking of bringing those features into the core, instead of neb modules, as a default way of icinga doing checks. if we were to do this, we'd definitely prefer dnx developers joining in, instead of constantly watching for developments in dnx and then bringing those back info icinga. that was the reason I thought I'd see what the dnx developers think. thanks for the replies thus far. > Eric Schoeller > > William Leibzon wrote:DNX is all open-source and GNU license so you should feel free to > integrate it into your package if you believe it to be of interest to > your users. But DNX working preferentially with you would imply > support for your forked package over nagios which is not what I expect > people like to see. My opinion of course. > > On Wed, Apr 28, 2010 at 12:51 PM, Hiren Patel <hir...@gm...> wrote: > would the team be interested in merging features/functions of dnx into icinga? > |
From: Eric S. <eri...@co...> - 2010-05-05 21:15:42
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> I think I've stumbled across the icinga project before, the website looks familiar. How does the feature set compare with Ninja/Merlin or the "op5 monitor appliance" hosted at op5.org?<br> <br> Your feature list for icinga includes a brief blurb about "redundancy with distributed monitoring" ... how exactly is that implemented now? I assume that you're looking to integrate DNX with icinga to improve upon what method you're already using? Does icinga provide redundancy/load-balancing for the master nagios server?<br> <br> Eric Schoeller<br> <br> William Leibzon wrote: <blockquote cite="mid:z2z...@ma..." type="cite"> <pre wrap="">DNX is all open-source and GNU license so you should feel free to integrate it into your package if you believe it to be of interest to your users. But DNX working preferentially with you would imply support for your forked package over nagios which is not what I expect people like to see. My opinion of course. On Wed, Apr 28, 2010 at 12:51 PM, Hiren Patel <a class="moz-txt-link-rfc2396E" href="mailto:hir...@gm..."><hir...@gm...></a> wrote: </pre> <blockquote type="cite"> <pre wrap="">would the team be interested in merging features/functions of dnx into icinga? -- Hiren Patel <a class="moz-txt-link-rfc2396E" href="mailto:hir...@gm..."><hir...@gm...></a> ------------------------------------------------------------------------------ _______________________________________________ Dnx-devel mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dnx...@li...">Dnx...@li...</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/dnx-devel">https://lists.sourceforge.net/lists/listinfo/dnx-devel</a> </pre> </blockquote> <pre wrap=""><!----> ------------------------------------------------------------------------------ _______________________________________________ Dnx-devel mailing list <a class="moz-txt-link-abbreviated" href="mailto:Dnx...@li...">Dnx...@li...</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/dnx-devel">https://lists.sourceforge.net/lists/listinfo/dnx-devel</a> </pre> </blockquote> </body> </html> |
From: William L. <wi...@le...> - 2010-05-04 18:29:41
|
DNX is all open-source and GNU license so you should feel free to integrate it into your package if you believe it to be of interest to your users. But DNX working preferentially with you would imply support for your forked package over nagios which is not what I expect people like to see. My opinion of course. On Wed, Apr 28, 2010 at 12:51 PM, Hiren Patel <hir...@gm...> wrote: > would the team be interested in merging features/functions of dnx into icinga? > > -- > Hiren Patel <hir...@gm...> > > ------------------------------------------------------------------------------ > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel > |
From: Hiren P. <hir...@gm...> - 2010-04-28 19:52:18
|
would the team be interested in merging features/functions of dnx into icinga? -- Hiren Patel <hir...@gm...> |
From: John C. <joh...@gm...> - 2010-04-13 21:45:51
|
Announcing DNX 0.20.1. This is a minor release that fixes a few compile issues on BSD Unix. Thanks to Eric Cable for helping us sort out the issues. See the DNX website or project site for download links: http://dnx.sourceforge.net http://sourceforge.net/projects/dnx Enjoy! John |
From: John C. <joh...@gm...> - 2010-04-01 18:09:15
|
Everyone, I'm pleased to announce the release of DNX version 0.20. Much time and effort has gone into making this the most stable and solid DNX release ever. Check out the DNX web site for more details, and download links: http://dnx.sourceforge.net Thanks! John DNX 0.20 Release Notes (NEWS and relevant ChangeLog entries): Version 0.20 ============ - Fixed client-side clock skew issue caused by ntsd moving real-time clock backward. Especially prevalent on VMWare and other virtual systems. - Replaced filesystem based results processing with faster in-memory results processing. - Added new split plugin/server model code. - Removed cruft from Nagios config.h include file to avoid unnecessary build environment requirements. - Streamlined server-side node-centric statistics for efficiency. - Fixed descriptor leaks in pfopen code on clients. - Updated INSTALL document to provide DNX-specific installation instructions. - Added option to logging subsystem to log to file or to syslog. - Added code to allow transports to track messages in/out. (Haven't surfaced this code yet in the interface.) - Replaced configure's --with-nagios3x command-line option with a new Nagios 2.x-specific --with-nagios2x option. The default configuration now targets Nagios 3.x. *** Revision 368: Released dnx-0.20: 01-APR-10 * Revision: 368: 01-APR-10 jcalcote Updated ChangeLog with release information. * Revision: 367: 30-MAR-10 jcalcote Update ChangeLog and NEWS file in preparation for release 0.20. * Revision: 366: 23-MAR-10 jcalcote Fix defect in job timer wherein clock skewe can cause large job times. * Revision: 365: 22-MAR-10 jcalcote Add code to merge nagios and dnx result lists, rather than just assume the nagios list is empty. * Revision: 363-364: 20-MAR-10 jcalcote Update AUTHORS file. Implement Jason's direct post to resulst list during reaper event handler algorithm. * Revision: 362: 14-MAR-10 jcalcote Only show active node information in server-side stats. * Revision: 360-361: 09-MAR-10 jcalcote Fixed bug in nagios3xPostResult where queue file was getting named incorrectly, and error path was not correct. Remove the dependency on PATH_MAX in plugin and server NebMain files. * Revision: 358-359: 08-MAR-10 jcalcote Modularize server and plugin/server management agent code. Aesthetic changes to configure.ac and the client's help output. * Revision: 356-357: 07-MAR-10 jcalcote Fixed missing stat for jobs_rejected_no_nodes in integrated server. Reorganize post results for Nagios 3 so that umask is not called by secondary threads. Update server unit tests. * Revision: 353-355: 05-MAR-10 jcalcote Fix a few minor stats defects in server and plugin/server. Change echo statements to AS_ECHO in configure.ac. Incorporated server stats into node stats. Renamed sources from node to stats. Transferred new server and node stats module to integrated server. Replaced original stats listener with an agent listener as in the plugin/stats. Updated nagios README file. * Revision: 352: 04-MAR-10 jcalcote Update comments for doxygen. * Revision: 350-351: 03-MAR-10 jcalcote Removed obsolete AC_PROG_RANLIB macro. Removed commented out cruft in Nagios interface config.h files. Updated configure.ac for new system calls, types, and headers. Updated INSTALL slightly. Removed autogen.sh - use autoreconf now. Updated NEWS file. * Revision: 349: 02-MAR-10 jcalcote Cleaned up file pipe/fork management in plugin/dnxNebMain.c. * Revision: 348: 01-MAR-10 jcalcote Finished reimplementation of node-based server-side stats. Fixed a minor issue with make dist in client. * Revision: 347: 28-FEB-10 jcalcote Switched from select to poll in dnxPlugin on client; should allow more than 1024 file descripters to be used. * Revision: 346: 26-FEB-10 jcalcote Fixed file descriptor leaks in pfopen. * Revision: 345: 22-FEB-10 jcalcote Add more descriptive text to the maxRequestNodes parameter in dnxServer.cfg.in. * Revision: 343-344: 21-FEB-10 jcalcote Cleanup nagios interface config.h files. Updated copyrights on all plugin and stand-alone server source files. Enhanced plugin results listener for robustness. * Revision: 340-342: 19-FEB-10 jcalcote Updated ChangeLog, TODO, INSTALL, and NEWS files. Update INSTALL document. Changed installation location of dnxServer to libexec dir; made all requisite surrounding changes. Added installation instructions to INSTALL. Fixed bug in plugin that caused Nagios to segfault on debug print. Fixed bug in sa server that caused server to hang on shutdown. * Revision: 339: 18-FEB-10 jcalcote Added stats back into plugin/server. Updated README. Tweaked mockNagios a bit. Modified configure command line to automatically build against Nagios 3.x, with cmdline option to build against 2.x. * Revision: 335-338: 17-FEB-10 jcalcote Backoff on the client worker thread sleep time - 2 rather than 10 seconds. Add a delay to mockNagios to ensure that the client knows the server is up before running test. Enhanced comm protocol between plugin and server to allow for pre-allocation of node requests with early out for no node conditions. Moved server exec into post nagios configuration stage so service list would be accurately sized. Added a command line option to the server to accept the job queue size - configured the servers job queue size accordingly. Enhanced plugin debug messages for rejected service checks. Fixed build order. Update build system to reflect new products. Enhance doxygen build in doc directory to be dependent on installation of doxygen. * Revision: 332-334: 15-FEB-10 jcalcote Clean up doxygen comments in all new sources. Get mockNagios to work with nagios 3.x. Many changes to get split server to work. * Revision: 331: 10-FEB-10 jcalcote Enhanced logging a bit more. Updated etc config templates. Added new options for 2-part server plugin module. * Revision: 329-330: 09-FEB-10 jcalcote Enhanced logging to send to file or syslog with options. Enhance plugin test to read env variables. Fixed test config files. Fixed infinite timeout issue in client command loop. Enhancements to plugin test. * Revision: 326-328: 08-FEB-10 jcalcote Updated working version to 0.20. Add ack channels to plugin/server comm. Added new split server/plugin architecture - in plugin directory. |
From: Jason B. <ja...@ba...> - 2010-03-26 01:23:29
|
On Mar 26, 2010, at 10:06 AM, Daniel Tuecks wrote: > Hello Jason, > > I see your point. Let me make it more clear: I can totally live with > your idea of adding workers to all hostgroups. The primary goal should > be the integration of your affinity patch to dnx. Just wanted to throw > in some additional ideas :) > > But I am wondering: I have two customer DMZ and both contain webservers. > I want to put two workers in the DMZ of customer "A" and two workers in > the DMZ of "B". Furthermore I have a NOC team for which I create another > hostgroup called "webservers". Members of this hostgroup are not hosts, > but hostgroups. For example: > > define hostgroup { > hostgroup_name dmz_custA > members webserver1,webserver2,dnxworker1,dnxworker2 > } > > define hostgroup { > hostgroup_name dmz_custB > members webserver3,webserver4,dnxworker3,dnxworker4 > } > > define hostgroup { > hostgroup_name webservers > hostgroup_members dmz_custA,dmz_custB > } > Well the UI and configuration tool I have didn't support groups of groups (even though I want to use them too), so I didn't have to deal with that specific issue. I would have to see exactly how nagios treats groups of groups, if there is some way for me to detect that I could easily deal with this condition, although people will also want the exact opposite to work as well (add a dnxClient just to the group of groups to service all the sub-hosts.) I will look at it next week, I'm focusing on some other related projects right now. The truth is when I started hacking on DNX I was trying to maintain backwards compatibility, and it constrained a lot of my design decisions, at some point it didn't look like anyone was testing my patches or interested in what I was doing, and I went way off the reservation. At this point the reintegration of my code to the main truck would be pretty difficult and the changes I made modify a lot of proven code, so I really don't know what the right way for the project to go at this point. > What would happen? Could checks for "dmz_custB" be executed by > "dmz_custA"'s dnxwokers(1+2) as they are combined in the hostgroup > "webservers"? > > I think I could fix this by doing the following (not tested); > > define hostgroup { > hostgroup_name dmz_custA_dnxworkers > members dnxworker1,dnxworker2 > } > > define hostgroup { > hostgroup_name dmz_custB_dnxworkers > members dnxworker1,dnxworker2 > } > > define hostgroup { > hostgroup_name webservers > hostgroup_members > webservers_dmz_custA,webservers_dmz_custB,!dmz_custA_dnxworkers,!dmz_custB_dnxworkers > ; don't include dnx_workers > } > > Unfortunately I have some "overlapping" hostgroups, so defining the set > of workernodes per host would more suit our configuration style. The > above example is only a simple one but when you have multiple > overlapping groups even the "fix" might be quite challenging to maintain. > > > Daniel > > > > On 03/25/2010 05:31 AM, Jason Benguerel wrote: >> I think his issue was that a specific dnxClient wasn't guaranteed to be able to see all the hosts in a single hostgroup. I knew there could be this possibility, but figured that I could just re-factor my hostgroups if it became an issue. It's unfair to force people to do that if I can create a convenient host level configuration as he proposed. I think for many the way I originally did it will work well, and it is painless to configure, but no reason to not support power users if possible. >> >>> "The DNX workers shouldn't be members of my "Windows Active Directory Servers" group." >> >> The idea that dnxClients are in the hostgroups that they service is a feature, as I wanted it to be very clear where to look when there are problems, so I kind of disagree with this. As I said, the GUI I have makes it clear what clients are servicing a hostgroup via this mechanism, so my end users don't do something dumb like powercycle all the clients in a hostgroup at the same time and so forth. But, I don't want to dictate how people use things. >> >> >>> Hi Daniel, >>> >>> Just thinking about this feature a little tonight. It won't be added >>> till after the 0.20 release of DNX, which will happen soon. This may >>> sound strange, but I'm not the Nagios expert I should be. My lack of >>> experience is centered around configuration and use of Nagios. >>> Nevertheless, I do have some understanding of the way Nagios works >>> internally. >>> >>> I'm wondering whether such a custom variable can be added to a >>> hostgroup, and subsequently inherited by the hosts that derive from that >>> hostgroup. It would make setting up DNX affinity variables pretty simple >>> if a custom variable can be added to a hostgroup like this: >>> >>> define hostgroup { >>> name dnx-group1 >>> _dnxworkers dmz001, dmz002 >>> } >>> >>> define host { >>> use standardhost >>> name some-host >>> hostgroups webservers,customer-a,linux,dnx-group1 >>> } >>> >>> Thoughts? >>> >>> John >>> >>> On 3/22/2010 5:02 PM, Daniel Tuecks wrote: >>>> Hello, >>>> >>>> this would indeed be my most-wanted feature in DNX, too. John, Jason, I >>>> really hope you to get this done :) >>>> I tested affinity some time ago (when it was first discussed on this >>>> list) and I liked it very much. >>>> >>>> Back then affinity was controlled by putting dnx-workers in my existing >>>> hostgroups. >>>> That was the only thing I did not like that much. It's a little >>>> confusing to have one or more dnx workers in every hostgroup. Besides I >>>> think this is not what a "hostgroup" is intended for. The DNX workers >>>> shouldn't be members of my "Windows Active Directory Servers" group. >>>> >>>> What do you think about controlling affinity via a custom host variable >>>> (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? >>>> Something like this: >>>> >>>> define host { >>>> use standardhost >>>> name test-host >>>> hostgroups webservers,customer-a,linux >>>> _DNXworkergroup dmz001 >>>> } >>>> >>>> We could tag each Host with a worker(group) name. DNX/affinity would >>>> work more transparent/flexible and hostgroups would be "more correct". >>>> >>>> What do you think? >>>> >>>> Daniel >>>> >>>> >>>> On 03/18/2010 07:52 AM, Thomas Wollner wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> Hello, >>>>> >>>>> >>>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>>> correctly) find his way to the current DNX version? If not, are there >>>>> any plans to incorporate them or something similar? This topic was >>>>> discussed on the list some time ago. >>>>> >>>>> cheers, >>>>> >>>>> Tom >>>>> >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG v1.4.2 (MingW32) >>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>>> >>>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>>> FCSgQHvg4LjwbGiRek1/m+w= >>>>> =Gvn5 >>>>> -----END PGP SIGNATURE----- >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Download Intel® Parallel Studio Eval >>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>> proactively, and fine-tune applications for parallel performance. >>>>> See why Intel Parallel Studio got high marks during beta. >>>>> http://p.sf.net/sfu/intel-sw-dev >>>>> _______________________________________________ >>>>> Dnx-devel mailing list >>>>> Dnx...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Download Intel® Parallel Studio Eval >>>> Try the new software tools for yourself. Speed compiling, find bugs >>>> proactively, and fine-tune applications for parallel performance. >>>> See why Intel Parallel Studio got high marks during beta. >>>> http://p.sf.net/sfu/intel-sw-dev >>>> _______________________________________________ >>>> Dnx-devel mailing list >>>> Dnx...@li... >>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: Daniel T. <dan...@t-...> - 2010-03-26 01:06:31
|
Hello Jason, I see your point. Let me make it more clear: I can totally live with your idea of adding workers to all hostgroups. The primary goal should be the integration of your affinity patch to dnx. Just wanted to throw in some additional ideas :) But I am wondering: I have two customer DMZ and both contain webservers. I want to put two workers in the DMZ of customer "A" and two workers in the DMZ of "B". Furthermore I have a NOC team for which I create another hostgroup called "webservers". Members of this hostgroup are not hosts, but hostgroups. For example: define hostgroup { hostgroup_name dmz_custA members webserver1,webserver2,dnxworker1,dnxworker2 } define hostgroup { hostgroup_name dmz_custB members webserver3,webserver4,dnxworker3,dnxworker4 } define hostgroup { hostgroup_name webservers hostgroup_members dmz_custA,dmz_custB } What would happen? Could checks for "dmz_custB" be executed by "dmz_custA"'s dnxwokers(1+2) as they are combined in the hostgroup "webservers"? I think I could fix this by doing the following (not tested); define hostgroup { hostgroup_name dmz_custA_dnxworkers members dnxworker1,dnxworker2 } define hostgroup { hostgroup_name dmz_custB_dnxworkers members dnxworker1,dnxworker2 } define hostgroup { hostgroup_name webservers hostgroup_members webservers_dmz_custA,webservers_dmz_custB,!dmz_custA_dnxworkers,!dmz_custB_dnxworkers ; don't include dnx_workers } Unfortunately I have some "overlapping" hostgroups, so defining the set of workernodes per host would more suit our configuration style. The above example is only a simple one but when you have multiple overlapping groups even the "fix" might be quite challenging to maintain. Daniel On 03/25/2010 05:31 AM, Jason Benguerel wrote: > I think his issue was that a specific dnxClient wasn't guaranteed to be able to see all the hosts in a single hostgroup. I knew there could be this possibility, but figured that I could just re-factor my hostgroups if it became an issue. It's unfair to force people to do that if I can create a convenient host level configuration as he proposed. I think for many the way I originally did it will work well, and it is painless to configure, but no reason to not support power users if possible. > >> "The DNX workers shouldn't be members of my "Windows Active Directory Servers" group." > > The idea that dnxClients are in the hostgroups that they service is a feature, as I wanted it to be very clear where to look when there are problems, so I kind of disagree with this. As I said, the GUI I have makes it clear what clients are servicing a hostgroup via this mechanism, so my end users don't do something dumb like powercycle all the clients in a hostgroup at the same time and so forth. But, I don't want to dictate how people use things. > > >> Hi Daniel, >> >> Just thinking about this feature a little tonight. It won't be added >> till after the 0.20 release of DNX, which will happen soon. This may >> sound strange, but I'm not the Nagios expert I should be. My lack of >> experience is centered around configuration and use of Nagios. >> Nevertheless, I do have some understanding of the way Nagios works >> internally. >> >> I'm wondering whether such a custom variable can be added to a >> hostgroup, and subsequently inherited by the hosts that derive from that >> hostgroup. It would make setting up DNX affinity variables pretty simple >> if a custom variable can be added to a hostgroup like this: >> >> define hostgroup { >> name dnx-group1 >> _dnxworkers dmz001, dmz002 >> } >> >> define host { >> use standardhost >> name some-host >> hostgroups webservers,customer-a,linux,dnx-group1 >> } >> >> Thoughts? >> >> John >> >> On 3/22/2010 5:02 PM, Daniel Tuecks wrote: >>> Hello, >>> >>> this would indeed be my most-wanted feature in DNX, too. John, Jason, I >>> really hope you to get this done :) >>> I tested affinity some time ago (when it was first discussed on this >>> list) and I liked it very much. >>> >>> Back then affinity was controlled by putting dnx-workers in my existing >>> hostgroups. >>> That was the only thing I did not like that much. It's a little >>> confusing to have one or more dnx workers in every hostgroup. Besides I >>> think this is not what a "hostgroup" is intended for. The DNX workers >>> shouldn't be members of my "Windows Active Directory Servers" group. >>> >>> What do you think about controlling affinity via a custom host variable >>> (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? >>> Something like this: >>> >>> define host { >>> use standardhost >>> name test-host >>> hostgroups webservers,customer-a,linux >>> _DNXworkergroup dmz001 >>> } >>> >>> We could tag each Host with a worker(group) name. DNX/affinity would >>> work more transparent/flexible and hostgroups would be "more correct". >>> >>> What do you think? >>> >>> Daniel >>> >>> >>> On 03/18/2010 07:52 AM, Thomas Wollner wrote: >>> >>>> -----BEGIN PGP SIGNED MESSAGE----- >>>> Hash: SHA1 >>>> >>>> Hello, >>>> >>>> >>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>> correctly) find his way to the current DNX version? If not, are there >>>> any plans to incorporate them or something similar? This topic was >>>> discussed on the list some time ago. >>>> >>>> cheers, >>>> >>>> Tom >>>> >>>> -----BEGIN PGP SIGNATURE----- >>>> Version: GnuPG v1.4.2 (MingW32) >>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>> >>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>> FCSgQHvg4LjwbGiRek1/m+w= >>>> =Gvn5 >>>> -----END PGP SIGNATURE----- >>>> >>>> ------------------------------------------------------------------------------ >>>> Download Intel® Parallel Studio Eval >>>> Try the new software tools for yourself. Speed compiling, find bugs >>>> proactively, and fine-tune applications for parallel performance. >>>> See why Intel Parallel Studio got high marks during beta. >>>> http://p.sf.net/sfu/intel-sw-dev >>>> _______________________________________________ >>>> Dnx-devel mailing list >>>> Dnx...@li... >>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel > |
From: Daniel T. <dan...@t-...> - 2010-03-26 01:02:13
|
Hi John! No, unfortunately Nagios doesn't inherit this way. Hostgroups are 'dumb' and take only definitions of name, hostgroupname, members and alias. You can't add custom vars here. Inheritance is only achieved via so called templates. We could do the following: define host { name dnxwoker_dmz001 use standardhost _dnxworkers dmz001-1,dmz001-2 register 0 ; this a template } define host { name dnxwoker_dmz002 use standardhost _dnxworkers dmz002-1,dmz002-2,dmz003-2 register 0 ; this a template } define host { use dnxworker_dmz001 ; _dnxworkers of dnxwoker_dmz001 is inherited host_name some-host hostgroups webservers,customer-a,linux } define host { use dnxworker_dmz002 ; _dnxworkers of dnxwoker_dmz002 is inherited host_name anotherhost hostgroups webservers,customer-b,linux } This inherits _dnxworkers to all hosts that are using the "dnxworker_dmz001" templates (which is inherited by "standardhost"). That would be the closest to your idea, I think. Unfortunately Nagios supports "on-demand Macros" (like $HOSTGROUPMEMBERS:mydnxworkers1$) only when its executing external commands/checks, so this is not an option. Daniel On 03/25/2010 05:14 AM, John Calcote wrote: > Hi Daniel, > > Just thinking about this feature a little tonight. It won't be added > till after the 0.20 release of DNX, which will happen soon. This may > sound strange, but I'm not the Nagios expert I should be. My lack of > experience is centered around configuration and use of Nagios. > Nevertheless, I do have some understanding of the way Nagios works > internally. > > I'm wondering whether such a custom variable can be added to a > hostgroup, and subsequently inherited by the hosts that derive from that > hostgroup. It would make setting up DNX affinity variables pretty simple > if a custom variable can be added to a hostgroup like this: > > define hostgroup { > name dnx-group1 > _dnxworkers dmz001, dmz002 > } > > define host { > use standardhost > name some-host > hostgroups webservers,customer-a,linux,dnx-group1 > } > > Thoughts? > > John > > On 3/22/2010 5:02 PM, Daniel Tuecks wrote: >> Hello, >> >> this would indeed be my most-wanted feature in DNX, too. John, Jason, I >> really hope you to get this done :) >> I tested affinity some time ago (when it was first discussed on this >> list) and I liked it very much. >> >> Back then affinity was controlled by putting dnx-workers in my existing >> hostgroups. >> That was the only thing I did not like that much. It's a little >> confusing to have one or more dnx workers in every hostgroup. Besides I >> think this is not what a "hostgroup" is intended for. The DNX workers >> shouldn't be members of my "Windows Active Directory Servers" group. >> >> What do you think about controlling affinity via a custom host variable >> (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? >> Something like this: >> >> define host { >> use standardhost >> name test-host >> hostgroups webservers,customer-a,linux >> _DNXworkergroup dmz001 >> } >> >> We could tag each Host with a worker(group) name. DNX/affinity would >> work more transparent/flexible and hostgroups would be "more correct". >> >> What do you think? >> >> Daniel >> >> >> On 03/18/2010 07:52 AM, Thomas Wollner wrote: >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Hello, >>> >>> >>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>> correctly) find his way to the current DNX version? If not, are there >>> any plans to incorporate them or something similar? This topic was >>> discussed on the list some time ago. >>> >>> cheers, >>> >>> Tom >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2 (MingW32) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>> >>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>> FCSgQHvg4LjwbGiRek1/m+w= >>> =Gvn5 >>> -----END PGP SIGNATURE----- >>> >>> ------------------------------------------------------------------------------ >>> >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> >> ------------------------------------------------------------------------------ >> >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> > |
From: Jason B. <ja...@ba...> - 2010-03-25 04:31:26
|
I think his issue was that a specific dnxClient wasn't guaranteed to be able to see all the hosts in a single hostgroup. I knew there could be this possibility, but figured that I could just re-factor my hostgroups if it became an issue. It's unfair to force people to do that if I can create a convenient host level configuration as he proposed. I think for many the way I originally did it will work well, and it is painless to configure, but no reason to not support power users if possible. > "The DNX workers shouldn't be members of my "Windows Active Directory Servers" group." The idea that dnxClients are in the hostgroups that they service is a feature, as I wanted it to be very clear where to look when there are problems, so I kind of disagree with this. As I said, the GUI I have makes it clear what clients are servicing a hostgroup via this mechanism, so my end users don't do something dumb like powercycle all the clients in a hostgroup at the same time and so forth. But, I don't want to dictate how people use things. > Hi Daniel, > > Just thinking about this feature a little tonight. It won't be added > till after the 0.20 release of DNX, which will happen soon. This may > sound strange, but I'm not the Nagios expert I should be. My lack of > experience is centered around configuration and use of Nagios. > Nevertheless, I do have some understanding of the way Nagios works > internally. > > I'm wondering whether such a custom variable can be added to a > hostgroup, and subsequently inherited by the hosts that derive from that > hostgroup. It would make setting up DNX affinity variables pretty simple > if a custom variable can be added to a hostgroup like this: > > define hostgroup { > name dnx-group1 > _dnxworkers dmz001, dmz002 > } > > define host { > use standardhost > name some-host > hostgroups webservers,customer-a,linux,dnx-group1 > } > > Thoughts? > > John > > On 3/22/2010 5:02 PM, Daniel Tuecks wrote: >> Hello, >> >> this would indeed be my most-wanted feature in DNX, too. John, Jason, I >> really hope you to get this done :) >> I tested affinity some time ago (when it was first discussed on this >> list) and I liked it very much. >> >> Back then affinity was controlled by putting dnx-workers in my existing >> hostgroups. >> That was the only thing I did not like that much. It's a little >> confusing to have one or more dnx workers in every hostgroup. Besides I >> think this is not what a "hostgroup" is intended for. The DNX workers >> shouldn't be members of my "Windows Active Directory Servers" group. >> >> What do you think about controlling affinity via a custom host variable >> (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? >> Something like this: >> >> define host { >> use standardhost >> name test-host >> hostgroups webservers,customer-a,linux >> _DNXworkergroup dmz001 >> } >> >> We could tag each Host with a worker(group) name. DNX/affinity would >> work more transparent/flexible and hostgroups would be "more correct". >> >> What do you think? >> >> Daniel >> >> >> On 03/18/2010 07:52 AM, Thomas Wollner wrote: >> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Hello, >>> >>> >>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>> correctly) find his way to the current DNX version? If not, are there >>> any plans to incorporate them or something similar? This topic was >>> discussed on the list some time ago. >>> >>> cheers, >>> >>> Tom >>> >>> -----BEGIN PGP SIGNATURE----- >>> Version: GnuPG v1.4.2 (MingW32) >>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>> >>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>> FCSgQHvg4LjwbGiRek1/m+w= >>> =Gvn5 >>> -----END PGP SIGNATURE----- >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> >> > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: John C. <joh...@gm...> - 2010-03-25 04:14:54
|
Hi Daniel, Just thinking about this feature a little tonight. It won't be added till after the 0.20 release of DNX, which will happen soon. This may sound strange, but I'm not the Nagios expert I should be. My lack of experience is centered around configuration and use of Nagios. Nevertheless, I do have some understanding of the way Nagios works internally. I'm wondering whether such a custom variable can be added to a hostgroup, and subsequently inherited by the hosts that derive from that hostgroup. It would make setting up DNX affinity variables pretty simple if a custom variable can be added to a hostgroup like this: define hostgroup { name dnx-group1 _dnxworkers dmz001, dmz002 } define host { use standardhost name some-host hostgroups webservers,customer-a,linux,dnx-group1 } Thoughts? John On 3/22/2010 5:02 PM, Daniel Tuecks wrote: > Hello, > > this would indeed be my most-wanted feature in DNX, too. John, Jason, I > really hope you to get this done :) > I tested affinity some time ago (when it was first discussed on this > list) and I liked it very much. > > Back then affinity was controlled by putting dnx-workers in my existing > hostgroups. > That was the only thing I did not like that much. It's a little > confusing to have one or more dnx workers in every hostgroup. Besides I > think this is not what a "hostgroup" is intended for. The DNX workers > shouldn't be members of my "Windows Active Directory Servers" group. > > What do you think about controlling affinity via a custom host variable > (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? > Something like this: > > define host { > use standardhost > name test-host > hostgroups webservers,customer-a,linux > _DNXworkergroup dmz001 > } > > We could tag each Host with a worker(group) name. DNX/affinity would > work more transparent/flexible and hostgroups would be "more correct". > > What do you think? > > Daniel > > > On 03/18/2010 07:52 AM, Thomas Wollner wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hello, >> >> >> did the affinity patch from Jason Bengeruel (hope I spelled the name >> correctly) find his way to the current DNX version? If not, are there >> any plans to incorporate them or something similar? This topic was >> discussed on the list some time ago. >> >> cheers, >> >> Tom >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2 (MingW32) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >> >> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >> FCSgQHvg4LjwbGiRek1/m+w= >> =Gvn5 >> -----END PGP SIGNATURE----- >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel > > |
From: Eric S. <esc...@us...> - 2010-03-24 18:41:47
|
Hi everyone, I've been testing revision 365 using the plugin/server method. At this point I've gotten about 41 hours of runtime with no major problems. So, stability issues with the main nagios daemon were resolved. The nagios "Service Execution Time" stat is also working correctly now. During this runtime period I observed 388 results_timed_out and 57 results_failed. This is very minimal in comparison to the 3,245,080 jobs that were results_ok :) These timed_out/results_failed were probably legitimate problems detected, and not something awry with DNX. I'm going to switch to 366, switch to the integrated server and re-introduce the check_rand service definitions into my configuration. After that runs for awhile I'll start upping poolMax on my worker node(s) to determine if there is still an issue there. I had a memory problem with our trending server last night and as a result my testing results may be delayed while I work with SUN (or, I guess Oracle) to resolve this issue. Thanks for all the hard work! Eric John Calcote wrote: > Jason, Eric, and everyone, > > Eric, the debug exception you were experiencing in revision 364 was an > assertion I added to the code to ensure that dnx wouldn't tromp on any > results already in the Nagios result list when the timed event handler > fired. I erroneously assumed the nagios result list would be empty when > the dnx timed event handler was called before the result reaper was > executed because of the way the result reaper function works. It reads > all the result files into the list, then processes the entire list. > Given this, there shouldn't really be anything in the list the next time > the timer fires because all the results were read in last time. However, > there are apparently other ways of adding results to the results list > (presumably on the same thread, since there are no locks protecting the > list), so the list isn't necessarily empty when the results reaper is > called. > > Anyway, to fix this, I did what I should have done to begin with - I > implemented a sorted-list merge routine that merges dnx's list with > nagios's list. This is still done on the result reaper thread, so it's > all thread-safe, but it's a little bit slower than it would have been > had nagios's list been empty to begin with. Regardless, it's still an > order of magnitude faster than what I was doing with filesystem-based > results. > > Please try revision 365. > > I've completely removed the assertion - no need for it now that I'm > merging lists (incidentally, if the nagios list happens to be empty, the > merge is very efficient, degenerating into approximately the same code I > had before when just pointing nagios to the dnx list). > > John > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-users mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-users > |
From: John C. <joh...@gm...> - 2010-03-24 16:54:01
|
Hi Eric, This underscores the fact that we need to do a release of 0.20 pretty soon. Any feedback for me on revision 365-366? Revision 366 is a fix for a client-side time defect, which I mentioned on the list yesterday. This will probably fix the weird job times you're seeing from DNX jobs. Thanks to Steven for pointing it out to me. John On 3/24/2010 10:20 AM, Eric Schoeller wrote: > Yes, that is the correct flag. Sorry I wasn't in front of a real machine > at the moment to verify it. In future versions of DNX that option will > be the default, and you'll have to specify --with-nagios2x to build DNX > against a Nagios 2 environment. > > You may run into some stability issues with 0.19.4. We've made a lot of > changes to the DNX codebase recently and I'd encourage you to try out > the latest revision from the subversion repository. If you need help > checking out the latest version or have any other build questions, don't > hesitate to ask! > > Eric > > > Roman Yakovenko wrote: > >> On Wed, Mar 24, 2010 at 5:03 PM, Eric Schoeller >> <eri...@co...> wrote: >> >> >>> Hello, >>> >>> The version of Nagios you're using doesn't require the patch. Also >>> make sure you're configuring DNX with the nagios3 option. >>> >>> >> Thanks for soooooooo quick reply. >> >> The option I passed was --with-nagios3x. Is this a right flag? Anyway >> I am going to recheck myself one more time. >> >> Thank you. >> >> >> >> >> > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-users mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-users > > |
From: John C. <joh...@gm...> - 2010-03-23 21:50:41
|
Hi Steven, Thanks for the report. Your suggestion is really good, but I took it one step farther: jobstart = clock(); dnxPluginExecute(job.cmd, &result.resCode, resData, sizeof resData - 1, job.timeout,iwlm->cfg.showNodeAddr? iwlm->myipaddrstr: 0); jobstop = clock(); result.delta = (unsigned)((jobstop > jobstart? jobstop - jobstart: 0) / CLOCKS_PER_SEC); I implemented your technique, but I also switched from time(0) to clock() because clock() isn't subject to real-time clock skew. You still have to do the magnitude check before subtraction because (on some systems) clock_t can wrap even more often than time_t. This just means that one or two jobs a day will appear to execute *very* fast. :) svn revision 366. Regards, John On 3/23/2010 1:05 PM, Morrey, Steven wrote: > Hello, > I've found a bug in dnxWLM.c that is causing headaches on systems where the clock can skew backwards such as VMs. > The problem stems from the delta calculation, in the dnxWorker function (where the thread spends most of it's life). > I believe the source to be here... > > jobstart = time(0); > dnxPluginExecute(job.cmd,&result.resCode, resData, sizeof resData - 1, job.timeout,iwlm->cfg.showNodeAddr? iwlm->myipaddrstr: 0); > result.delta = time(0) - jobstart; > > The result.delta is an unsigned int , so if that bottom calculation returns a negative number because of a clock skew then you end up with an overflow issue. > > A better solution might look like > > jobstart = time(0); > dnxPluginExecute(job.cmd,&result.resCode, resData, sizeof resData - 1, job.timeout,iwlm->cfg.showNodeAddr? iwlm->myipaddrstr: 0); > jobstop = time(0); > > if(jobstop> jobstart) > { > result.delta = jobstop - jobstart; > }else{ > result.delta = 0; > } > > Any thoughts? > > Respectfully; > Steven D. Morrey > OneNeck IT Services > Shared Services - Monitoring > GDC: 480-539-2242 > > > > Privileged/Confidential Information may be contained in this message or attachments hereto. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel > > |
From: Morrey, S. <Steven.Morrey@OneNeck.com> - 2010-03-23 19:06:09
|
Hello, I've found a bug in dnxWLM.c that is causing headaches on systems where the clock can skew backwards such as VMs. The problem stems from the delta calculation, in the dnxWorker function (where the thread spends most of it's life). I believe the source to be here... jobstart = time(0); dnxPluginExecute(job.cmd, &result.resCode, resData, sizeof resData - 1, job.timeout,iwlm->cfg.showNodeAddr? iwlm->myipaddrstr: 0); result.delta = time(0) - jobstart; The result.delta is an unsigned int , so if that bottom calculation returns a negative number because of a clock skew then you end up with an overflow issue. A better solution might look like jobstart = time(0); dnxPluginExecute(job.cmd, &result.resCode, resData, sizeof resData - 1, job.timeout,iwlm->cfg.showNodeAddr? iwlm->myipaddrstr: 0); jobstop = time(0); if(jobstop > jobstart) { result.delta = jobstop - jobstart; }else{ result.delta = 0; } Any thoughts? Respectfully; Steven D. Morrey OneNeck IT Services Shared Services - Monitoring GDC: 480-539-2242 Privileged/Confidential Information may be contained in this message or attachments hereto. Please advise immediately if you or your employer do not consent to Internet email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of this company shall be understood as neither given nor endorsed by it. |
From: Jason B. <ja...@ba...> - 2010-03-23 01:59:51
|
> Hello, > > this would indeed be my most-wanted feature in DNX, too. John, Jason, I > really hope you to get this done :) > I tested affinity some time ago (when it was first discussed on this > list) and I liked it very much. > > Back then affinity was controlled by putting dnx-workers in my existing > hostgroups. Heh, it still is. I didn't realize anyone ever tried the patch, nice to know someone gave it a spin. > That was the only thing I did not like that much. It's a little > confusing to have one or more dnx workers in every hostgroup. Besides I > think this is not what a "hostgroup" is intended for. The DNX workers > shouldn't be members of my "Windows Active Directory Servers" group. Well, I can understand that. The thinking behind it is that I am supporting very low skill-set NOC guys that use the open source Monarch Nagios configuration utility. It doesn't have a lot of flexibility about local Nagios configuration extensions and I didn't want to have to make lots of modifications to everything. Also, right now, DNX has it's own configuration file independent of Nagios, so having to start parsing and dealing with Nagios config files or local configuration variables adds another layer of complexity. I obviously extended the concept of what a hostgroup means, but in our context it was appropriate. Our monitoring front end differentiates between DNX clients and hosts, so it is really clear to the NOC guys what are DNX clients and what are monitored hosts and how many Clients are servicing a hostgroup, and what their status is. The hostgroup membership method was easily understood by my staff and didn't require any significant code or configuration changes for us. And Nagios can remain ignorant of what's happening. But the custom objects are a cool idea, so I will look at how to leverage them. Obviously, not everyone is going to want to rearrange their infrastructure to map to this concept, so some other means of configuration would be useful. Internally Affinity is done by matching a 64bit flag assigned to each hostgroup to each DNXclient job request. It would not be difficult to extend this to your Custom configuration below, as long as this custom object data is passed to the NEB module through the Nagios plugin interface. I will investigate this soonish, I have to do some work on the affinity section of code anyway as it is limited to 64 hostgroups at the moment. > > What do you think about controlling affinity via a custom host variable > (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? > Something like this: > > define host { > use standardhost > name test-host > hostgroups webservers,customer-a,linux > _DNXworkergroup dmz001 > } > > We could tag each Host with a worker(group) name. DNX/affinity would > work more transparent/flexible and hostgroups would be "more correct". The two methods would work in parallel, so we should be able to do it either way. The latest Affinity code is at: http://github.com/Bakafish/DNX_Affinity But I have to put a fix in for the result linklist tromping bug that has been recently discussed. I should have the fix done in a day or two. > > What do you think? > > Daniel > > > On 03/18/2010 07:52 AM, Thomas Wollner wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hello, >> >> >> did the affinity patch from Jason Bengeruel (hope I spelled the name >> correctly) find his way to the current DNX version? If not, are there >> any plans to incorporate them or something similar? This topic was >> discussed on the list some time ago. >> >> cheers, >> >> Tom >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2 (MingW32) >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >> >> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >> FCSgQHvg4LjwbGiRek1/m+w= >> =Gvn5 >> -----END PGP SIGNATURE----- >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: Daniel T. <dan...@t-...> - 2010-03-22 23:02:04
|
Hello, this would indeed be my most-wanted feature in DNX, too. John, Jason, I really hope you to get this done :) I tested affinity some time ago (when it was first discussed on this list) and I liked it very much. Back then affinity was controlled by putting dnx-workers in my existing hostgroups. That was the only thing I did not like that much. It's a little confusing to have one or more dnx workers in every hostgroup. Besides I think this is not what a "hostgroup" is intended for. The DNX workers shouldn't be members of my "Windows Active Directory Servers" group. What do you think about controlling affinity via a custom host variable (http://nagios.sourceforge.net/docs/3_0/customobjectvars.html)? Something like this: define host { use standardhost name test-host hostgroups webservers,customer-a,linux _DNXworkergroup dmz001 } We could tag each Host with a worker(group) name. DNX/affinity would work more transparent/flexible and hostgroups would be "more correct". What do you think? Daniel On 03/18/2010 07:52 AM, Thomas Wollner wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hello, > > > did the affinity patch from Jason Bengeruel (hope I spelled the name > correctly) find his way to the current DNX version? If not, are there > any plans to incorporate them or something similar? This topic was > discussed on the list some time ago. > > cheers, > > Tom > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2 (MingW32) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp > FCSgQHvg4LjwbGiRek1/m+w= > =Gvn5 > -----END PGP SIGNATURE----- > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: John C. <joh...@gm...> - 2010-03-22 16:22:21
|
Jason, Eric, and everyone, Eric, the debug exception you were experiencing in revision 364 was an assertion I added to the code to ensure that dnx wouldn't tromp on any results already in the Nagios result list when the timed event handler fired. I erroneously assumed the nagios result list would be empty when the dnx timed event handler was called before the result reaper was executed because of the way the result reaper function works. It reads all the result files into the list, then processes the entire list. Given this, there shouldn't really be anything in the list the next time the timer fires because all the results were read in last time. However, there are apparently other ways of adding results to the results list (presumably on the same thread, since there are no locks protecting the list), so the list isn't necessarily empty when the results reaper is called. Anyway, to fix this, I did what I should have done to begin with - I implemented a sorted-list merge routine that merges dnx's list with nagios's list. This is still done on the result reaper thread, so it's all thread-safe, but it's a little bit slower than it would have been had nagios's list been empty to begin with. Regardless, it's still an order of magnitude faster than what I was doing with filesystem-based results. Please try revision 365. I've completely removed the assertion - no need for it now that I'm merging lists (incidentally, if the nagios list happens to be empty, the merge is very efficient, degenerating into approximately the same code I had before when just pointing nagios to the dnx list). John |
From: John C. <joh...@gm...> - 2010-03-21 19:16:42
|
Hi Eric, You'll have to go back to revision 361 for a day or two. I need to make a fix to this code. The crash comes from a debug assertion that I put into the new code I wrote. I assumed that Nagios's list would always be empty when the reaper timed event handler was called. This apparently is not the case, so that means I'll have to write a routine to merge DNX's list with Nagos's list, rather than just set Nagios's list to point to DNX's. John On 3/20/2010 11:24 PM, Eric Schoeller wrote: > John, > > Unfortunately I get about 1-2 minutes of runtime before a nagios crash. > Last log lines in dnxsrv.debug.log: > > [Sat Mar 20 23:05:53.1 2010] dnxJobListDispatch: BEFORE: Head=193, > DHead=584, Tail=584. > [Sat Mar 20 23:05:53.1 2010] Reaper handler called. > > And the last lines in nagios.debug ... looks familiar ;) > > [1269147957.086182] [016.2] [pid=15287] Moving temp check result file > '/dev/shm/nagios/var/spool/checkresults/checkF0xL2e' to queue file > '/dev/shm/nagios/var/spool/checkresults/c7FmBmf'... > [1269147957.092212] [016.2] [pid=15292] Moving temp check result file > '/dev/shm/nagios/var/spool/checkresults/checkePb2K2' to queue file > '/dev/shm/nagios/var/spool/checkresults/ccuU052'... > > This appears to happen with either the integrated server or the > plugin/server. I removed all the check_rand service definitions to help > simplify things for nagios, but even with only check_ping services I see > this behavior. I'll run nagios with no registered worker nodes overnight > to make sure the daemon will run OK on its own - I'm sure it will, but I > can't think of anything else to tweak at the moment. > > The next question you'll ask ... where is your core file! Great > question, I still can't get one. Rather frustrating. I'll keep poking > around to see why it's not dropping one. > > Eric > > > John Calcote wrote: > >> Hi all, >> >> For those who are testing the pre-release of DNX version 0.20, I've just >> finished an implementation of Jason Benguerel's Direct Result Post >> algorithm (Thanks Jason!). This algorithm bypasses the filesystem-based >> results queue entirely, and writes check_result objects directly to the >> result reaper's check_result_list. >> >> The result reaper runs as a timed event in Nagios. Each time around, the >> reaper processes all result files in the result queue by reading the >> files and posting result objects to the check result list. Then it >> traverses the result list and processes each result in the list. DNX >> writes results to its own internal results list. It hooks the reaper >> broker event, called before the reaper gets to run. When this event >> runs, DNX transfers it's entire sorted result list directly into the >> nagios result list (no race conditions because it's all done on the >> nagios reaper thread. When the Nagios reaper runs, it simply adds any >> remaining results generated by itself to the list, and then processes >> the entire list. >> >> Check out revision 364 (or higher) and let me know what you think. DNX >> should be significantly more responsive with respect to result posting >> using this method. >> >> John >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-users mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-users >> >> > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel > > |
From: Eric S. <esc...@us...> - 2010-03-21 05:24:28
|
John, Unfortunately I get about 1-2 minutes of runtime before a nagios crash. Last log lines in dnxsrv.debug.log: [Sat Mar 20 23:05:53.1 2010] dnxJobListDispatch: BEFORE: Head=193, DHead=584, Tail=584. [Sat Mar 20 23:05:53.1 2010] Reaper handler called. And the last lines in nagios.debug ... looks familiar ;) [1269147957.086182] [016.2] [pid=15287] Moving temp check result file '/dev/shm/nagios/var/spool/checkresults/checkF0xL2e' to queue file '/dev/shm/nagios/var/spool/checkresults/c7FmBmf'... [1269147957.092212] [016.2] [pid=15292] Moving temp check result file '/dev/shm/nagios/var/spool/checkresults/checkePb2K2' to queue file '/dev/shm/nagios/var/spool/checkresults/ccuU052'... This appears to happen with either the integrated server or the plugin/server. I removed all the check_rand service definitions to help simplify things for nagios, but even with only check_ping services I see this behavior. I'll run nagios with no registered worker nodes overnight to make sure the daemon will run OK on its own - I'm sure it will, but I can't think of anything else to tweak at the moment. The next question you'll ask ... where is your core file! Great question, I still can't get one. Rather frustrating. I'll keep poking around to see why it's not dropping one. Eric John Calcote wrote: > Hi all, > > For those who are testing the pre-release of DNX version 0.20, I've just > finished an implementation of Jason Benguerel's Direct Result Post > algorithm (Thanks Jason!). This algorithm bypasses the filesystem-based > results queue entirely, and writes check_result objects directly to the > result reaper's check_result_list. > > The result reaper runs as a timed event in Nagios. Each time around, the > reaper processes all result files in the result queue by reading the > files and posting result objects to the check result list. Then it > traverses the result list and processes each result in the list. DNX > writes results to its own internal results list. It hooks the reaper > broker event, called before the reaper gets to run. When this event > runs, DNX transfers it's entire sorted result list directly into the > nagios result list (no race conditions because it's all done on the > nagios reaper thread. When the Nagios reaper runs, it simply adds any > remaining results generated by itself to the list, and then processes > the entire list. > > Check out revision 364 (or higher) and let me know what you think. DNX > should be significantly more responsive with respect to result posting > using this method. > > John > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-users mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-users > |
From: John C. <joh...@gm...> - 2010-03-20 22:27:11
|
Hi all, For those who are testing the pre-release of DNX version 0.20, I've just finished an implementation of Jason Benguerel's Direct Result Post algorithm (Thanks Jason!). This algorithm bypasses the filesystem-based results queue entirely, and writes check_result objects directly to the result reaper's check_result_list. The result reaper runs as a timed event in Nagios. Each time around, the reaper processes all result files in the result queue by reading the files and posting result objects to the check result list. Then it traverses the result list and processes each result in the list. DNX writes results to its own internal results list. It hooks the reaper broker event, called before the reaper gets to run. When this event runs, DNX transfers it's entire sorted result list directly into the nagios result list (no race conditions because it's all done on the nagios reaper thread. When the Nagios reaper runs, it simply adds any remaining results generated by itself to the list, and then processes the entire list. Check out revision 364 (or higher) and let me know what you think. DNX should be significantly more responsive with respect to result posting using this method. John |
From: Jason B. <ja...@ba...> - 2010-03-20 17:08:13
|
On Mar 21, 2010, at 1:10 AM, John Calcote wrote: > No, I'm the one who should apologize. I should have been offering > solutions instead of telling you why your approach would not work. But > it takes time to mull the problem around in your head before you come up > with an idea. I was working on other code, and I didn't want to deep dive back in DNX as I was under time constraints. I just felt bad that I was proposing stuff without actually looking to see if it would work :-) > > While you could do what you propose - empty the queue for Nagios in the > broker handler - a better solution would be less intrusive on Nagios. > Here's a proposal: > > DNX only writes to the queue, it never removes anything. Nagios does all > of this in a single thread; fills the queue, and then empties it all on > the same thread - on a timer in fact. What DNX could do is maintain it's > own linked list of check results. Register for the TIMEDEVENT/REAPER > call. In your handler, add the entire list to Nagios's list - you can do > it in one operation. Since you're doing it on Nagios's thread, you're > not in danger of a race condition, and it's a fast operation O(1), so it > won't slow Nagios down. When the broker call returns, Nagios will add > file-based results to the list, and then reap the entire list. Yeah, I like that. I already have the results in a nice queue, I will just push them in during this reap callback. > > Because you're only adding results to the DNX list, and it's a > singly-linked list, you can do this without locks if you write it carefully. > > Jason, I really appreciate this conversation. I think we've found a > mutual solution that is far better than what I've been doing, and at > least somewhat better than what you've been doing. Thanks so much. I probably never would have found this bug myself, just would have wondered why the occasional check was being lost or garbled. Thank you! > > John > > On 3/20/2010 9:00 AM, Jason Benguerel wrote: >> I should have given a solution or rescinded that code, sorry to make you go through this. >> >> >>> Not all broker calls allow a handler to hijack the operation. >>> Unfortunately, the TIMEDEVENT check is one of those that does not allow >>> it. You can perform the operation from within the event handler, but >>> nothing you can't return will stop Nagios from doing it also. >>> >> Okay, but if we just made the call and emptied all the pending results files, and it is single threaded, so no more results have been written in between our call and when the Nagios call immediately following ours is made, the Nagios call will never actually have any files to process nor write to that structure, so by wrapping our call in the mutex to clean out the results files, it would be safe right? I know it sounds hacky, but will it work so people don't have to patch? >> >> >>> Before DNX, none of the Nagios brokered events allowed a handler to >>> hijack the operation. It was the DNX patch that first introduced the >>> concept into the Nagios code base, and it took us from version 2.0 to >>> version 3.1 to get them to finally accept the patch as part of their >>> code base. Most of the Nagios broker calls are just meant to be used as >>> a sort of logging or statistics-gathering mechanism for brokered event >>> handlers (that is, if any thought at all went into potential uses for >>> this mechanism - which I seriously doubt). >>> >> Right, I remember the patches. >> >> >>> John >>> >>> On 3/18/2010 9:35 PM, Jason Benguerel wrote: >>> >>>> On Mar 19, 2010, at 12:20 PM, John Calcote wrote: >>>> >>>> >>>> >>>>> There are two problems here, as you've noticed. First, clashes between >>>>> dnx's two result poster threads, and nagios's single result poster (the >>>>> routine that pulls the current set of result files into memory and adds >>>>> them to the result list), which runs serially in the timed event loop. >>>>> Second, clashes between dnx's two result poster threads and nagios's >>>>> result reader, which also runs serially with the result file processor >>>>> in the same nagios timed event handler thread. A set of results are read >>>>> into the list, and then those results are processed one at a time by >>>>> removing them from the front of the list. In nagios, this is all done >>>>> serially, so there's not potential for corruption. >>>>> >>>>> The problem with using the TIMEDEVENT broker call is that >>>>> NEBTYPE_TIMEDEVENT_EXECUTE is called before the event is executed, but >>>>> not after, so you'd be able to acquire your mutex, but you wouldn't be >>>>> able to release it. >>>>> >>>>> >>>> I was thinking we'd take the callback and execute the function ourselves inside the mutex. Nagios will no longer make that call, DNX will in a thread safe fashion. Am I missing something? >>>> >>>> >>>> >>>>> John >>>>> >>>>> On 3/18/2010 9:00 PM, Jason Benguerel wrote: >>>>> >>>>> >>>>>> Ahh, I think my way of not running more than about 5 local checks hid this contention from surfacing. I think registering the callbacks and implementing the DNX mutex around those calls should be effective though. What do you think? I will try and modify my code to see how it works... >>>>>> >>>>>> Jason >>>>>> >>>>>> >>>>>> On Mar 19, 2010, at 11:54 AM, John Calcote wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> One other point I should have mentioned is that there are only two dnx threads that access the list - the collector and the timer. The collector listens on the dnx job socket for results, and processes each one serially as they're received. If it weren't for the timer, which pulls out expired jobs from the dispatch list and submits a "timed-out" result using the same method, dnx wouldn't need a mutex either. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> On 3/18/2010 8:44 PM, Jason Benguerel wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Ahh, that's a good question. It is my own MUTEX, and Nagios in my application runs very few local checks so I may have not seen a clash. The function itself seems to insert things based on execution time which also may be insulating me from a clash. But you are correct in that there is danger there. Is there a MUTEX for the reaper? >>>>>>>> >>>>>>>> Jason >>>>>>>> >>>>>>>> On Mar 19, 2010, at 11:29 AM, John Calcote wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Jason, >>>>>>>>> >>>>>>>>> I'm looking at the add_check_result_to_list function in nagios 3.2.0 and I see there's no such mutex as submitCheckMutex. This probably means your DNX code created this mutex. That's not a problem, the mutex clearly keeps DNX threads from stepping on each other as they try to add items to the list. What is a problem is that DNX doesn't synchronized it's own access to the list with those of Nagios. Nagios doesn't need a mutex because it has a single queue reaper thread that accesses the list serially, but how do you keep nagios's and DNX's simultaneous updates from stomping on each other? The only way I can see this would work is if Nagios *never* submitted anything to the list. If that's the case, then you've offloaded all service and host checks from Nagios. Is this true? >>>>>>>>> >>>>>>>>> John >>>>>>>>> >>>>>>>>> On 3/18/2010 8:15 PM, Jason Benguerel wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Well it really wasn't my intention to fork, I just never got enough buy in to the idea from the user base and I wasn't confident enough in my C skills to want to inject garbage into the Trunk. I also am not really great about working with others (good comments, tests, documentation and regular checkins :-) ) Anyway, I hope that this will help with some of the temp file issues you've been having. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Jason, >>>>>>>>>>> >>>>>>>>>>> Thanks for the info. I guess I just never looked at what Nagios was doing on the back end with the result files. I'll add this fix right away. >>>>>>>>>>> >>>>>>>>>>> I wish I had time to study your code base to see what other cool features you've added. The problem is that we software people don't take enough advantage of the synergy that's a primary attribute of "the open source way". I wish more folks would submit patches upstream, rather than just forking and modifying. There's nothing wrong with a fork, but when good ideas like this come along, they're often hidden by such forks. >>>>>>>>>>> >>>>>>>>>>> Thanks for your efforts, >>>>>>>>>>> John >>>>>>>>>>> >>>>>>>>>>> On 3/18/2010 7:51 PM, Jason Benguerel wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Mar 19, 2010, at 12:24 AM, John Calcote wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Hi Jason, >>>>>>>>>>>>> >>>>>>>>>>>>> I've actually been moving towards a two-part server system in an effort to get a bit closer to where you are, so I could consider your mods. However, I'm wondering how you managed to not use any result files. Do you patch Nagios? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> No patch, here's what I'm doing (based on code from Bronx): >>>>>>>>>>>> >>>>>>>>>>>> int dnxSubmitCheck(DnxNewJob * Job, DnxResult * sResult, time_t check_time) >>>>>>>>>>>> { >>>>>>>>>>>> DNX_PT_MUTEX_LOCK(&submitCheckMutex); >>>>>>>>>>>> >>>>>>>>>>>> check_result *chk_result; >>>>>>>>>>>> chk_result = (check_result *)malloc(sizeof(check_result)); >>>>>>>>>>>> /* Set the default values in the check result structure */ >>>>>>>>>>>> init_check_result(chk_result); >>>>>>>>>>>> >>>>>>>>>>>> /* >>>>>>>>>>>> * Set up the check result structure with information that we were passed >>>>>>>>>>>> * Nagios normally reads the check results from a diskfile specified in >>>>>>>>>>>> * output_file member. But since we can directly access nagios result list, >>>>>>>>>>>> * we bypass the diskfile creation. We set output_file to NULL and >>>>>>>>>>>> * the fd to -1, hoping that nagios will have a NULL check. >>>>>>>>>>>> */ >>>>>>>>>>>> chk_result->output_file = NULL; >>>>>>>>>>>> chk_result->output_file_fd = -1; >>>>>>>>>>>> chk_result->host_name = xstrdup(Job->host_name); >>>>>>>>>>>> >>>>>>>>>>>> ... more check result loading ... >>>>>>>>>>>> >>>>>>>>>>>> /* Call the nagios function to insert the result into the result linklist */ >>>>>>>>>>>> add_check_result_to_list(chk_result); >>>>>>>>>>>> DNX_PT_MUTEX_UNLOCK(&submitCheckMutex); >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> This is one of the first changes I made to my codebase, and it has never given me any issues. I'd say it's a very reliable alternative to what's currently being done. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm not really sold on the two part module solution. To be frank, if I was going to break out all the logic from the NEB module as you are doing, I would completely replace all the DNX dispatching logic with a RestMQ or a AMQP (RabbitMQ) solution. I guess I'm more comfortable with having DNX as a fully embedded NEB so if it get's wonky it takes out Nagios. For me, DNX being down is the same thing as Nagios being down. Nagios without DNX in my environment is useless, and I don't want to have to worry about a separate daemon's state. As we all know from experience there's a lot of things that can go wrong without it being obvious, and I already have mechanisms to know if Nagios is functioning. Adding another process to monitor is not appealing to me. I'm not sure what it's trying to solve anyway, as the thread locking conditions that you guys are experiencing are not happening to me, so I believe they are solvable via a NEB module if you are careful with your MUTEX's and get rid of writing files yourself. >>>>>>>>>>>> >>>>>>>>>>>> J >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> John >>>>>>>>>>>>> >>>>>>>>>>>>> On 3/18/2010 1:31 AM, Jason Benguerel wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> I ended up hacking the code pretty severely, it's currently an experimental fork. You can download my current version at: >>>>>>>>>>>>>> >>>>>>>>>>>>>> http://github.com/Bakafish/DNX_Affinity >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm not sure how the rest of the DNX users want to get these changes merged, or how much interest there is to do so. It's been working well for me, and I think due to side effects of my changes I'm not suffering the locking and race conditions that the trunk seems to. I don't write any check results temp files for example, and I dealt with mutexes a bit differently in places. I also implemented basic acknowledgments to deal with UDP packet loss. Anyway, let me know how it works for you and if you have trouble configuring it. It's designed for Nagios 3.x, so if you are using 1 or 2, this isn't for you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jason >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 18, 2010, at 3:52 PM, Thomas Wollner wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>>>>>>> Hash: SHA1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>>>>>>>>>>>>> correctly) find his way to the current DNX version? If not, are there >>>>>>>>>>>>>>> any plans to incorporate them or something similar? This topic was >>>>>>>>>>>>>>> discussed on the list some time ago. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> cheers, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Tom >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE----- >>>>>>>>>>>>>>> Version: GnuPG v1.4.2 (MingW32) >>>>>>>>>>>>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>>>>>>>>>>>>> FCSgQHvg4LjwbGiRek1/m+w= >>>>>>>>>>>>>>> =Gvn5 >>>>>>>>>>>>>>> -----END PGP SIGNATURE----- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Download Intel® Parallel Studio Eval >>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>> proactively, and fine-tune applications for parallel performance. >>>>> See why Intel Parallel Studio got high marks during beta. >>>>> http://p.sf.net/sfu/intel-sw-dev >>>>> _______________________________________________ >>>>> Dnx-devel mailing list >>>>> Dnx...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>> >>>>> >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>> >> >> > > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: John C. <joh...@gm...> - 2010-03-20 16:10:47
|
No, I'm the one who should apologize. I should have been offering solutions instead of telling you why your approach would not work. But it takes time to mull the problem around in your head before you come up with an idea. While you could do what you propose - empty the queue for Nagios in the broker handler - a better solution would be less intrusive on Nagios. Here's a proposal: DNX only writes to the queue, it never removes anything. Nagios does all of this in a single thread; fills the queue, and then empties it all on the same thread - on a timer in fact. What DNX could do is maintain it's own linked list of check results. Register for the TIMEDEVENT/REAPER call. In your handler, add the entire list to Nagios's list - you can do it in one operation. Since you're doing it on Nagios's thread, you're not in danger of a race condition, and it's a fast operation O(1), so it won't slow Nagios down. When the broker call returns, Nagios will add file-based results to the list, and then reap the entire list. Because you're only adding results to the DNX list, and it's a singly-linked list, you can do this without locks if you write it carefully. Jason, I really appreciate this conversation. I think we've found a mutual solution that is far better than what I've been doing, and at least somewhat better than what you've been doing. Thanks so much. John On 3/20/2010 9:00 AM, Jason Benguerel wrote: > I should have given a solution or rescinded that code, sorry to make you go through this. > > >> Not all broker calls allow a handler to hijack the operation. >> Unfortunately, the TIMEDEVENT check is one of those that does not allow >> it. You can perform the operation from within the event handler, but >> nothing you can't return will stop Nagios from doing it also. >> > Okay, but if we just made the call and emptied all the pending results files, and it is single threaded, so no more results have been written in between our call and when the Nagios call immediately following ours is made, the Nagios call will never actually have any files to process nor write to that structure, so by wrapping our call in the mutex to clean out the results files, it would be safe right? I know it sounds hacky, but will it work so people don't have to patch? > > >> Before DNX, none of the Nagios brokered events allowed a handler to >> hijack the operation. It was the DNX patch that first introduced the >> concept into the Nagios code base, and it took us from version 2.0 to >> version 3.1 to get them to finally accept the patch as part of their >> code base. Most of the Nagios broker calls are just meant to be used as >> a sort of logging or statistics-gathering mechanism for brokered event >> handlers (that is, if any thought at all went into potential uses for >> this mechanism - which I seriously doubt). >> > Right, I remember the patches. > > >> John >> >> On 3/18/2010 9:35 PM, Jason Benguerel wrote: >> >>> On Mar 19, 2010, at 12:20 PM, John Calcote wrote: >>> >>> >>> >>>> There are two problems here, as you've noticed. First, clashes between >>>> dnx's two result poster threads, and nagios's single result poster (the >>>> routine that pulls the current set of result files into memory and adds >>>> them to the result list), which runs serially in the timed event loop. >>>> Second, clashes between dnx's two result poster threads and nagios's >>>> result reader, which also runs serially with the result file processor >>>> in the same nagios timed event handler thread. A set of results are read >>>> into the list, and then those results are processed one at a time by >>>> removing them from the front of the list. In nagios, this is all done >>>> serially, so there's not potential for corruption. >>>> >>>> The problem with using the TIMEDEVENT broker call is that >>>> NEBTYPE_TIMEDEVENT_EXECUTE is called before the event is executed, but >>>> not after, so you'd be able to acquire your mutex, but you wouldn't be >>>> able to release it. >>>> >>>> >>> I was thinking we'd take the callback and execute the function ourselves inside the mutex. Nagios will no longer make that call, DNX will in a thread safe fashion. Am I missing something? >>> >>> >>> >>>> John >>>> >>>> On 3/18/2010 9:00 PM, Jason Benguerel wrote: >>>> >>>> >>>>> Ahh, I think my way of not running more than about 5 local checks hid this contention from surfacing. I think registering the callbacks and implementing the DNX mutex around those calls should be effective though. What do you think? I will try and modify my code to see how it works... >>>>> >>>>> Jason >>>>> >>>>> >>>>> On Mar 19, 2010, at 11:54 AM, John Calcote wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> One other point I should have mentioned is that there are only two dnx threads that access the list - the collector and the timer. The collector listens on the dnx job socket for results, and processes each one serially as they're received. If it weren't for the timer, which pulls out expired jobs from the dispatch list and submits a "timed-out" result using the same method, dnx wouldn't need a mutex either. >>>>>> >>>>>> John >>>>>> >>>>>> On 3/18/2010 8:44 PM, Jason Benguerel wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Ahh, that's a good question. It is my own MUTEX, and Nagios in my application runs very few local checks so I may have not seen a clash. The function itself seems to insert things based on execution time which also may be insulating me from a clash. But you are correct in that there is danger there. Is there a MUTEX for the reaper? >>>>>>> >>>>>>> Jason >>>>>>> >>>>>>> On Mar 19, 2010, at 11:29 AM, John Calcote wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Jason, >>>>>>>> >>>>>>>> I'm looking at the add_check_result_to_list function in nagios 3.2.0 and I see there's no such mutex as submitCheckMutex. This probably means your DNX code created this mutex. That's not a problem, the mutex clearly keeps DNX threads from stepping on each other as they try to add items to the list. What is a problem is that DNX doesn't synchronized it's own access to the list with those of Nagios. Nagios doesn't need a mutex because it has a single queue reaper thread that accesses the list serially, but how do you keep nagios's and DNX's simultaneous updates from stomping on each other? The only way I can see this would work is if Nagios *never* submitted anything to the list. If that's the case, then you've offloaded all service and host checks from Nagios. Is this true? >>>>>>>> >>>>>>>> John >>>>>>>> >>>>>>>> On 3/18/2010 8:15 PM, Jason Benguerel wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Well it really wasn't my intention to fork, I just never got enough buy in to the idea from the user base and I wasn't confident enough in my C skills to want to inject garbage into the Trunk. I also am not really great about working with others (good comments, tests, documentation and regular checkins :-) ) Anyway, I hope that this will help with some of the temp file issues you've been having. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Jason, >>>>>>>>>> >>>>>>>>>> Thanks for the info. I guess I just never looked at what Nagios was doing on the back end with the result files. I'll add this fix right away. >>>>>>>>>> >>>>>>>>>> I wish I had time to study your code base to see what other cool features you've added. The problem is that we software people don't take enough advantage of the synergy that's a primary attribute of "the open source way". I wish more folks would submit patches upstream, rather than just forking and modifying. There's nothing wrong with a fork, but when good ideas like this come along, they're often hidden by such forks. >>>>>>>>>> >>>>>>>>>> Thanks for your efforts, >>>>>>>>>> John >>>>>>>>>> >>>>>>>>>> On 3/18/2010 7:51 PM, Jason Benguerel wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Mar 19, 2010, at 12:24 AM, John Calcote wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Hi Jason, >>>>>>>>>>>> >>>>>>>>>>>> I've actually been moving towards a two-part server system in an effort to get a bit closer to where you are, so I could consider your mods. However, I'm wondering how you managed to not use any result files. Do you patch Nagios? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> No patch, here's what I'm doing (based on code from Bronx): >>>>>>>>>>> >>>>>>>>>>> int dnxSubmitCheck(DnxNewJob * Job, DnxResult * sResult, time_t check_time) >>>>>>>>>>> { >>>>>>>>>>> DNX_PT_MUTEX_LOCK(&submitCheckMutex); >>>>>>>>>>> >>>>>>>>>>> check_result *chk_result; >>>>>>>>>>> chk_result = (check_result *)malloc(sizeof(check_result)); >>>>>>>>>>> /* Set the default values in the check result structure */ >>>>>>>>>>> init_check_result(chk_result); >>>>>>>>>>> >>>>>>>>>>> /* >>>>>>>>>>> * Set up the check result structure with information that we were passed >>>>>>>>>>> * Nagios normally reads the check results from a diskfile specified in >>>>>>>>>>> * output_file member. But since we can directly access nagios result list, >>>>>>>>>>> * we bypass the diskfile creation. We set output_file to NULL and >>>>>>>>>>> * the fd to -1, hoping that nagios will have a NULL check. >>>>>>>>>>> */ >>>>>>>>>>> chk_result->output_file = NULL; >>>>>>>>>>> chk_result->output_file_fd = -1; >>>>>>>>>>> chk_result->host_name = xstrdup(Job->host_name); >>>>>>>>>>> >>>>>>>>>>> ... more check result loading ... >>>>>>>>>>> >>>>>>>>>>> /* Call the nagios function to insert the result into the result linklist */ >>>>>>>>>>> add_check_result_to_list(chk_result); >>>>>>>>>>> DNX_PT_MUTEX_UNLOCK(&submitCheckMutex); >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> This is one of the first changes I made to my codebase, and it has never given me any issues. I'd say it's a very reliable alternative to what's currently being done. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm not really sold on the two part module solution. To be frank, if I was going to break out all the logic from the NEB module as you are doing, I would completely replace all the DNX dispatching logic with a RestMQ or a AMQP (RabbitMQ) solution. I guess I'm more comfortable with having DNX as a fully embedded NEB so if it get's wonky it takes out Nagios. For me, DNX being down is the same thing as Nagios being down. Nagios without DNX in my environment is useless, and I don't want to have to worry about a separate daemon's state. As we all know from experience there's a lot of things that can go wrong without it being obvious, and I already have mechanisms to know if Nagios is functioning. Adding another process to monitor is not appealing to me. I'm not sure what it's trying to solve anyway, as the thread locking conditions that you guys are experiencing are not happening to me, so I believe they are solvable via a NEB module if you are careful with your MUTEX's and get rid of writing files yourself. >>>>>>>>>>> >>>>>>>>>>> J >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> John >>>>>>>>>>>> >>>>>>>>>>>> On 3/18/2010 1:31 AM, Jason Benguerel wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I ended up hacking the code pretty severely, it's currently an experimental fork. You can download my current version at: >>>>>>>>>>>>> >>>>>>>>>>>>> http://github.com/Bakafish/DNX_Affinity >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure how the rest of the DNX users want to get these changes merged, or how much interest there is to do so. It's been working well for me, and I think due to side effects of my changes I'm not suffering the locking and race conditions that the trunk seems to. I don't write any check results temp files for example, and I dealt with mutexes a bit differently in places. I also implemented basic acknowledgments to deal with UDP packet loss. Anyway, let me know how it works for you and if you have trouble configuring it. It's designed for Nagios 3.x, so if you are using 1 or 2, this isn't for you. >>>>>>>>>>>>> >>>>>>>>>>>>> Jason >>>>>>>>>>>>> >>>>>>>>>>>>> On Mar 18, 2010, at 3:52 PM, Thomas Wollner wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>>>>>> Hash: SHA1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>>>>>>>>>>>> correctly) find his way to the current DNX version? If not, are there >>>>>>>>>>>>>> any plans to incorporate them or something similar? This topic was >>>>>>>>>>>>>> discussed on the list some time ago. >>>>>>>>>>>>>> >>>>>>>>>>>>>> cheers, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Tom >>>>>>>>>>>>>> >>>>>>>>>>>>>> -----BEGIN PGP SIGNATURE----- >>>>>>>>>>>>>> Version: GnuPG v1.4.2 (MingW32) >>>>>>>>>>>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>>>>>>>>>>>> FCSgQHvg4LjwbGiRek1/m+w= >>>>>>>>>>>>>> =Gvn5 >>>>>>>>>>>>>> -----END PGP SIGNATURE----- >>>>>>>>>>>>>> >>>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Download Intel® Parallel Studio Eval >>>> Try the new software tools for yourself. Speed compiling, find bugs >>>> proactively, and fine-tune applications for parallel performance. >>>> See why Intel Parallel Studio got high marks during beta. >>>> http://p.sf.net/sfu/intel-sw-dev >>>> _______________________________________________ >>>> Dnx-devel mailing list >>>> Dnx...@li... >>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>> >>>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> > > |
From: Jason B. <ja...@ba...> - 2010-03-20 15:00:57
|
I should have given a solution or rescinded that code, sorry to make you go through this. > Not all broker calls allow a handler to hijack the operation. > Unfortunately, the TIMEDEVENT check is one of those that does not allow > it. You can perform the operation from within the event handler, but > nothing you can't return will stop Nagios from doing it also. Okay, but if we just made the call and emptied all the pending results files, and it is single threaded, so no more results have been written in between our call and when the Nagios call immediately following ours is made, the Nagios call will never actually have any files to process nor write to that structure, so by wrapping our call in the mutex to clean out the results files, it would be safe right? I know it sounds hacky, but will it work so people don't have to patch? > > Before DNX, none of the Nagios brokered events allowed a handler to > hijack the operation. It was the DNX patch that first introduced the > concept into the Nagios code base, and it took us from version 2.0 to > version 3.1 to get them to finally accept the patch as part of their > code base. Most of the Nagios broker calls are just meant to be used as > a sort of logging or statistics-gathering mechanism for brokered event > handlers (that is, if any thought at all went into potential uses for > this mechanism - which I seriously doubt). Right, I remember the patches. > > John > > On 3/18/2010 9:35 PM, Jason Benguerel wrote: >> On Mar 19, 2010, at 12:20 PM, John Calcote wrote: >> >> >>> There are two problems here, as you've noticed. First, clashes between >>> dnx's two result poster threads, and nagios's single result poster (the >>> routine that pulls the current set of result files into memory and adds >>> them to the result list), which runs serially in the timed event loop. >>> Second, clashes between dnx's two result poster threads and nagios's >>> result reader, which also runs serially with the result file processor >>> in the same nagios timed event handler thread. A set of results are read >>> into the list, and then those results are processed one at a time by >>> removing them from the front of the list. In nagios, this is all done >>> serially, so there's not potential for corruption. >>> >>> The problem with using the TIMEDEVENT broker call is that >>> NEBTYPE_TIMEDEVENT_EXECUTE is called before the event is executed, but >>> not after, so you'd be able to acquire your mutex, but you wouldn't be >>> able to release it. >>> >> I was thinking we'd take the callback and execute the function ourselves inside the mutex. Nagios will no longer make that call, DNX will in a thread safe fashion. Am I missing something? >> >> >>> John >>> >>> On 3/18/2010 9:00 PM, Jason Benguerel wrote: >>> >>>> Ahh, I think my way of not running more than about 5 local checks hid this contention from surfacing. I think registering the callbacks and implementing the DNX mutex around those calls should be effective though. What do you think? I will try and modify my code to see how it works... >>>> >>>> Jason >>>> >>>> >>>> On Mar 19, 2010, at 11:54 AM, John Calcote wrote: >>>> >>>> >>>> >>>>> One other point I should have mentioned is that there are only two dnx threads that access the list - the collector and the timer. The collector listens on the dnx job socket for results, and processes each one serially as they're received. If it weren't for the timer, which pulls out expired jobs from the dispatch list and submits a "timed-out" result using the same method, dnx wouldn't need a mutex either. >>>>> >>>>> John >>>>> >>>>> On 3/18/2010 8:44 PM, Jason Benguerel wrote: >>>>> >>>>> >>>>>> Ahh, that's a good question. It is my own MUTEX, and Nagios in my application runs very few local checks so I may have not seen a clash. The function itself seems to insert things based on execution time which also may be insulating me from a clash. But you are correct in that there is danger there. Is there a MUTEX for the reaper? >>>>>> >>>>>> Jason >>>>>> >>>>>> On Mar 19, 2010, at 11:29 AM, John Calcote wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Jason, >>>>>>> >>>>>>> I'm looking at the add_check_result_to_list function in nagios 3.2.0 and I see there's no such mutex as submitCheckMutex. This probably means your DNX code created this mutex. That's not a problem, the mutex clearly keeps DNX threads from stepping on each other as they try to add items to the list. What is a problem is that DNX doesn't synchronized it's own access to the list with those of Nagios. Nagios doesn't need a mutex because it has a single queue reaper thread that accesses the list serially, but how do you keep nagios's and DNX's simultaneous updates from stomping on each other? The only way I can see this would work is if Nagios *never* submitted anything to the list. If that's the case, then you've offloaded all service and host checks from Nagios. Is this true? >>>>>>> >>>>>>> John >>>>>>> >>>>>>> On 3/18/2010 8:15 PM, Jason Benguerel wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Well it really wasn't my intention to fork, I just never got enough buy in to the idea from the user base and I wasn't confident enough in my C skills to want to inject garbage into the Trunk. I also am not really great about working with others (good comments, tests, documentation and regular checkins :-) ) Anyway, I hope that this will help with some of the temp file issues you've been having. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Jason, >>>>>>>>> >>>>>>>>> Thanks for the info. I guess I just never looked at what Nagios was doing on the back end with the result files. I'll add this fix right away. >>>>>>>>> >>>>>>>>> I wish I had time to study your code base to see what other cool features you've added. The problem is that we software people don't take enough advantage of the synergy that's a primary attribute of "the open source way". I wish more folks would submit patches upstream, rather than just forking and modifying. There's nothing wrong with a fork, but when good ideas like this come along, they're often hidden by such forks. >>>>>>>>> >>>>>>>>> Thanks for your efforts, >>>>>>>>> John >>>>>>>>> >>>>>>>>> On 3/18/2010 7:51 PM, Jason Benguerel wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mar 19, 2010, at 12:24 AM, John Calcote wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Hi Jason, >>>>>>>>>>> >>>>>>>>>>> I've actually been moving towards a two-part server system in an effort to get a bit closer to where you are, so I could consider your mods. However, I'm wondering how you managed to not use any result files. Do you patch Nagios? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> No patch, here's what I'm doing (based on code from Bronx): >>>>>>>>>> >>>>>>>>>> int dnxSubmitCheck(DnxNewJob * Job, DnxResult * sResult, time_t check_time) >>>>>>>>>> { >>>>>>>>>> DNX_PT_MUTEX_LOCK(&submitCheckMutex); >>>>>>>>>> >>>>>>>>>> check_result *chk_result; >>>>>>>>>> chk_result = (check_result *)malloc(sizeof(check_result)); >>>>>>>>>> /* Set the default values in the check result structure */ >>>>>>>>>> init_check_result(chk_result); >>>>>>>>>> >>>>>>>>>> /* >>>>>>>>>> * Set up the check result structure with information that we were passed >>>>>>>>>> * Nagios normally reads the check results from a diskfile specified in >>>>>>>>>> * output_file member. But since we can directly access nagios result list, >>>>>>>>>> * we bypass the diskfile creation. We set output_file to NULL and >>>>>>>>>> * the fd to -1, hoping that nagios will have a NULL check. >>>>>>>>>> */ >>>>>>>>>> chk_result->output_file = NULL; >>>>>>>>>> chk_result->output_file_fd = -1; >>>>>>>>>> chk_result->host_name = xstrdup(Job->host_name); >>>>>>>>>> >>>>>>>>>> ... more check result loading ... >>>>>>>>>> >>>>>>>>>> /* Call the nagios function to insert the result into the result linklist */ >>>>>>>>>> add_check_result_to_list(chk_result); >>>>>>>>>> DNX_PT_MUTEX_UNLOCK(&submitCheckMutex); >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> This is one of the first changes I made to my codebase, and it has never given me any issues. I'd say it's a very reliable alternative to what's currently being done. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm not really sold on the two part module solution. To be frank, if I was going to break out all the logic from the NEB module as you are doing, I would completely replace all the DNX dispatching logic with a RestMQ or a AMQP (RabbitMQ) solution. I guess I'm more comfortable with having DNX as a fully embedded NEB so if it get's wonky it takes out Nagios. For me, DNX being down is the same thing as Nagios being down. Nagios without DNX in my environment is useless, and I don't want to have to worry about a separate daemon's state. As we all know from experience there's a lot of things that can go wrong without it being obvious, and I already have mechanisms to know if Nagios is functioning. Adding another process to monitor is not appealing to me. I'm not sure what it's trying to solve anyway, as the thread locking conditions that you guys are experiencing are not happening to me, so I believe they are solvable via a NEB module if you are careful with your MUTEX's and get rid of writing files yourself. >>>>>>>>>> >>>>>>>>>> J >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> John >>>>>>>>>>> >>>>>>>>>>> On 3/18/2010 1:31 AM, Jason Benguerel wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> I ended up hacking the code pretty severely, it's currently an experimental fork. You can download my current version at: >>>>>>>>>>>> >>>>>>>>>>>> http://github.com/Bakafish/DNX_Affinity >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure how the rest of the DNX users want to get these changes merged, or how much interest there is to do so. It's been working well for me, and I think due to side effects of my changes I'm not suffering the locking and race conditions that the trunk seems to. I don't write any check results temp files for example, and I dealt with mutexes a bit differently in places. I also implemented basic acknowledgments to deal with UDP packet loss. Anyway, let me know how it works for you and if you have trouble configuring it. It's designed for Nagios 3.x, so if you are using 1 or 2, this isn't for you. >>>>>>>>>>>> >>>>>>>>>>>> Jason >>>>>>>>>>>> >>>>>>>>>>>> On Mar 18, 2010, at 3:52 PM, Thomas Wollner wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>>>>> Hash: SHA1 >>>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>>>>>>>>>>> correctly) find his way to the current DNX version? If not, are there >>>>>>>>>>>>> any plans to incorporate them or something similar? This topic was >>>>>>>>>>>>> discussed on the list some time ago. >>>>>>>>>>>>> >>>>>>>>>>>>> cheers, >>>>>>>>>>>>> >>>>>>>>>>>>> Tom >>>>>>>>>>>>> >>>>>>>>>>>>> -----BEGIN PGP SIGNATURE----- >>>>>>>>>>>>> Version: GnuPG v1.4.2 (MingW32) >>>>>>>>>>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>>>>>>>>>>> >>>>>>>>>>>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>>>>>>>>>>> FCSgQHvg4LjwbGiRek1/m+w= >>>>>>>>>>>>> =Gvn5 >>>>>>>>>>>>> -----END PGP SIGNATURE----- >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Download Intel® Parallel Studio Eval >>> Try the new software tools for yourself. Speed compiling, find bugs >>> proactively, and fine-tune applications for parallel performance. >>> See why Intel Parallel Studio got high marks during beta. >>> http://p.sf.net/sfu/intel-sw-dev >>> _______________________________________________ >>> Dnx-devel mailing list >>> Dnx...@li... >>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>> >> >> > > > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > _______________________________________________ > Dnx-devel mailing list > Dnx...@li... > https://lists.sourceforge.net/lists/listinfo/dnx-devel |
From: John C. <joh...@gm...> - 2010-03-19 15:34:52
|
Not all broker calls allow a handler to hijack the operation. Unfortunately, the TIMEDEVENT check is one of those that does not allow it. You can perform the operation from within the event handler, but nothing you can't return will stop Nagios from doing it also. Before DNX, none of the Nagios brokered events allowed a handler to hijack the operation. It was the DNX patch that first introduced the concept into the Nagios code base, and it took us from version 2.0 to version 3.1 to get them to finally accept the patch as part of their code base. Most of the Nagios broker calls are just meant to be used as a sort of logging or statistics-gathering mechanism for brokered event handlers (that is, if any thought at all went into potential uses for this mechanism - which I seriously doubt). John On 3/18/2010 9:35 PM, Jason Benguerel wrote: > On Mar 19, 2010, at 12:20 PM, John Calcote wrote: > > >> There are two problems here, as you've noticed. First, clashes between >> dnx's two result poster threads, and nagios's single result poster (the >> routine that pulls the current set of result files into memory and adds >> them to the result list), which runs serially in the timed event loop. >> Second, clashes between dnx's two result poster threads and nagios's >> result reader, which also runs serially with the result file processor >> in the same nagios timed event handler thread. A set of results are read >> into the list, and then those results are processed one at a time by >> removing them from the front of the list. In nagios, this is all done >> serially, so there's not potential for corruption. >> >> The problem with using the TIMEDEVENT broker call is that >> NEBTYPE_TIMEDEVENT_EXECUTE is called before the event is executed, but >> not after, so you'd be able to acquire your mutex, but you wouldn't be >> able to release it. >> > I was thinking we'd take the callback and execute the function ourselves inside the mutex. Nagios will no longer make that call, DNX will in a thread safe fashion. Am I missing something? > > >> John >> >> On 3/18/2010 9:00 PM, Jason Benguerel wrote: >> >>> Ahh, I think my way of not running more than about 5 local checks hid this contention from surfacing. I think registering the callbacks and implementing the DNX mutex around those calls should be effective though. What do you think? I will try and modify my code to see how it works... >>> >>> Jason >>> >>> >>> On Mar 19, 2010, at 11:54 AM, John Calcote wrote: >>> >>> >>> >>>> One other point I should have mentioned is that there are only two dnx threads that access the list - the collector and the timer. The collector listens on the dnx job socket for results, and processes each one serially as they're received. If it weren't for the timer, which pulls out expired jobs from the dispatch list and submits a "timed-out" result using the same method, dnx wouldn't need a mutex either. >>>> >>>> John >>>> >>>> On 3/18/2010 8:44 PM, Jason Benguerel wrote: >>>> >>>> >>>>> Ahh, that's a good question. It is my own MUTEX, and Nagios in my application runs very few local checks so I may have not seen a clash. The function itself seems to insert things based on execution time which also may be insulating me from a clash. But you are correct in that there is danger there. Is there a MUTEX for the reaper? >>>>> >>>>> Jason >>>>> >>>>> On Mar 19, 2010, at 11:29 AM, John Calcote wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Jason, >>>>>> >>>>>> I'm looking at the add_check_result_to_list function in nagios 3.2.0 and I see there's no such mutex as submitCheckMutex. This probably means your DNX code created this mutex. That's not a problem, the mutex clearly keeps DNX threads from stepping on each other as they try to add items to the list. What is a problem is that DNX doesn't synchronized it's own access to the list with those of Nagios. Nagios doesn't need a mutex because it has a single queue reaper thread that accesses the list serially, but how do you keep nagios's and DNX's simultaneous updates from stomping on each other? The only way I can see this would work is if Nagios *never* submitted anything to the list. If that's the case, then you've offloaded all service and host checks from Nagios. Is this true? >>>>>> >>>>>> John >>>>>> >>>>>> On 3/18/2010 8:15 PM, Jason Benguerel wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Well it really wasn't my intention to fork, I just never got enough buy in to the idea from the user base and I wasn't confident enough in my C skills to want to inject garbage into the Trunk. I also am not really great about working with others (good comments, tests, documentation and regular checkins :-) ) Anyway, I hope that this will help with some of the temp file issues you've been having. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Jason, >>>>>>>> >>>>>>>> Thanks for the info. I guess I just never looked at what Nagios was doing on the back end with the result files. I'll add this fix right away. >>>>>>>> >>>>>>>> I wish I had time to study your code base to see what other cool features you've added. The problem is that we software people don't take enough advantage of the synergy that's a primary attribute of "the open source way". I wish more folks would submit patches upstream, rather than just forking and modifying. There's nothing wrong with a fork, but when good ideas like this come along, they're often hidden by such forks. >>>>>>>> >>>>>>>> Thanks for your efforts, >>>>>>>> John >>>>>>>> >>>>>>>> On 3/18/2010 7:51 PM, Jason Benguerel wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Mar 19, 2010, at 12:24 AM, John Calcote wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Hi Jason, >>>>>>>>>> >>>>>>>>>> I've actually been moving towards a two-part server system in an effort to get a bit closer to where you are, so I could consider your mods. However, I'm wondering how you managed to not use any result files. Do you patch Nagios? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> No patch, here's what I'm doing (based on code from Bronx): >>>>>>>>> >>>>>>>>> int dnxSubmitCheck(DnxNewJob * Job, DnxResult * sResult, time_t check_time) >>>>>>>>> { >>>>>>>>> DNX_PT_MUTEX_LOCK(&submitCheckMutex); >>>>>>>>> >>>>>>>>> check_result *chk_result; >>>>>>>>> chk_result = (check_result *)malloc(sizeof(check_result)); >>>>>>>>> /* Set the default values in the check result structure */ >>>>>>>>> init_check_result(chk_result); >>>>>>>>> >>>>>>>>> /* >>>>>>>>> * Set up the check result structure with information that we were passed >>>>>>>>> * Nagios normally reads the check results from a diskfile specified in >>>>>>>>> * output_file member. But since we can directly access nagios result list, >>>>>>>>> * we bypass the diskfile creation. We set output_file to NULL and >>>>>>>>> * the fd to -1, hoping that nagios will have a NULL check. >>>>>>>>> */ >>>>>>>>> chk_result->output_file = NULL; >>>>>>>>> chk_result->output_file_fd = -1; >>>>>>>>> chk_result->host_name = xstrdup(Job->host_name); >>>>>>>>> >>>>>>>>> ... more check result loading ... >>>>>>>>> >>>>>>>>> /* Call the nagios function to insert the result into the result linklist */ >>>>>>>>> add_check_result_to_list(chk_result); >>>>>>>>> DNX_PT_MUTEX_UNLOCK(&submitCheckMutex); >>>>>>>>> return 0; >>>>>>>>> } >>>>>>>>> >>>>>>>>> This is one of the first changes I made to my codebase, and it has never given me any issues. I'd say it's a very reliable alternative to what's currently being done. >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm not really sold on the two part module solution. To be frank, if I was going to break out all the logic from the NEB module as you are doing, I would completely replace all the DNX dispatching logic with a RestMQ or a AMQP (RabbitMQ) solution. I guess I'm more comfortable with having DNX as a fully embedded NEB so if it get's wonky it takes out Nagios. For me, DNX being down is the same thing as Nagios being down. Nagios without DNX in my environment is useless, and I don't want to have to worry about a separate daemon's state. As we all know from experience there's a lot of things that can go wrong without it being obvious, and I already have mechanisms to know if Nagios is functioning. Adding another process to monitor is not appealing to me. I'm not sure what it's trying to solve anyway, as the thread locking conditions that you guys are experiencing are not happening to me, so I believe they are solvable via a NEB module if you are careful with your MUTEX's and get rid of writing files yourself. >>>>>>>>> >>>>>>>>> J >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> John >>>>>>>>>> >>>>>>>>>> On 3/18/2010 1:31 AM, Jason Benguerel wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I ended up hacking the code pretty severely, it's currently an experimental fork. You can download my current version at: >>>>>>>>>>> >>>>>>>>>>> http://github.com/Bakafish/DNX_Affinity >>>>>>>>>>> >>>>>>>>>>> I'm not sure how the rest of the DNX users want to get these changes merged, or how much interest there is to do so. It's been working well for me, and I think due to side effects of my changes I'm not suffering the locking and race conditions that the trunk seems to. I don't write any check results temp files for example, and I dealt with mutexes a bit differently in places. I also implemented basic acknowledgments to deal with UDP packet loss. Anyway, let me know how it works for you and if you have trouble configuring it. It's designed for Nagios 3.x, so if you are using 1 or 2, this isn't for you. >>>>>>>>>>> >>>>>>>>>>> Jason >>>>>>>>>>> >>>>>>>>>>> On Mar 18, 2010, at 3:52 PM, Thomas Wollner wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>>>>>>>> Hash: SHA1 >>>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> did the affinity patch from Jason Bengeruel (hope I spelled the name >>>>>>>>>>>> correctly) find his way to the current DNX version? If not, are there >>>>>>>>>>>> any plans to incorporate them or something similar? This topic was >>>>>>>>>>>> discussed on the list some time ago. >>>>>>>>>>>> >>>>>>>>>>>> cheers, >>>>>>>>>>>> >>>>>>>>>>>> Tom >>>>>>>>>>>> >>>>>>>>>>>> -----BEGIN PGP SIGNATURE----- >>>>>>>>>>>> Version: GnuPG v1.4.2 (MingW32) >>>>>>>>>>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ >>>>>>>>>>>> >>>>>>>>>>>> iD8DBQFLoc3BTCCRT+dccOYRAgjcAKC2//G79UvgMunjOhPq8dW47KAnGACg+SJp >>>>>>>>>>>> FCSgQHvg4LjwbGiRek1/m+w= >>>>>>>>>>>> =Gvn5 >>>>>>>>>>>> -----END PGP SIGNATURE----- >>>>>>>>>>>> >>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>>> Dnx...@li... >>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>> Download Intel® Parallel Studio Eval >>>>>>>>>>> Try the new software tools for yourself. Speed compiling, find bugs >>>>>>>>>>> proactively, and fine-tune applications for parallel performance. >>>>>>>>>>> See why Intel Parallel Studio got high marks during beta. >>>>>>>>>>> http://p.sf.net/sfu/intel-sw-dev >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Dnx-devel mailing list >>>>>>>>>>> Dnx...@li... >>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/dnx-devel >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Dnx-devel mailing list >> Dnx...@li... >> https://lists.sourceforge.net/lists/listinfo/dnx-devel >> > > |