From: Glenn H. <gh...@gw...> - 2010-12-20 20:02:42
|
>> I've done some organoleptic analysis of the patch (looked at it). The trouble with organoleptic analysis is that it's axiomatic in the benchmarking business that your guess (or mine) as to where time is being spent is simply wrong. The software+hardware computational model is much too complex for people to have accurate intuition about how all the elements interact. Actual measurements are needed. > I'll respond in more detail in a few days. GroundWork developed the circular dependency patch by profiling the code while running preflight operations on a large customer configuration. A pre-flight operation lists the following counts: Checked 35471 services. Checked 3362 hosts. Checked 116 host groups. Checked 0 service groups. Checked 290 contacts. Checked 225 contact groups. Checked 0 service escalations. Checked 51031 service dependencies. Checked 0 host escalations. Checked 0 host dependencies. Checked 475 commands. Checked 256 time periods. Total size of the set of Nagios config files is under 18 MB. To skip immediately to the final results, here are the timings from test runs while successive stages of this patch were being developed: baseline: pre-flight: 859.113 seconds stage 1: pre-flight: 185.139 seconds stage 2: pre-flight: 42.856 seconds stage 3: pre-flight: 3.107 seconds The baseline run was taken using a stock Nagios 3.2.0 release. The remaining runs were taken using patched versions of Nagios 3.2.1, which does not differ from 3.2.0 in its calculations for circular dependency analysis. The timings shown above are wall-clock measurements for end-to-end pre-flight operations. We also took detailed call-graph profile timings (using gprof) during this development effort to guide the code changes, but those reports were not retained afterward, so I don't have corresponding function-level timings available now to document each stage of patch development. The claim of a speedup of around 20,000x for the circular dependency analysis is based on improvements seen in those function-level timings during the development. ======================================================== Stage 1: Initial profiling showed that most of the time was being spent in pre_flight_circular_check(). There's no i/o in that routine or the ones it calls, so right away we know the problem we face is computational, not due to any system file-cacheing effects. Also, there's not much code in that routine that could be the source of the problem, so the nature of the bad performance was fairly obvious: an O(n^^2) clearing of the circular_path_checked values within independent data structures in a linked list. To address that, we made the following changes for the first stage of patching: * make a contiguous array for servicedependency_circular_path_checked, effectively removing the circular_path_checked flags from the individual data structures and also changing the boolean flags from 4 bytes down to 1 byte * use a call to memset() to clear the entire array in one operation, rather than walking a linked list Extra time is now taken to prepare the new contiguous array, but the timing results clearly show the benefit far outweighs the cost. The revised code is still O(n^^2) (as we still have to clear just as many flags, and just as often), but one dimension of this square has been radically shrunk. Modern processors are highly optimized for clearing a continuous swath of memory to a fixed value, since clearing large areas to zero is such a commonly needed action. So we use that mechanism instead of jumping randomly around in memory. (Aside from benefits of using probably-microcoded processor capabilities for the contiguous clearing, such as highly-efficient multimedia instructions, we also avoid a tremendous amount of potential cache-line aliasing and attendant inefficiency in memory accesses, and possibly extra page faults.) With contiguous boolean flags, we were also able to change their type from int to unsigned char, reducing the space and thus further reducing the amount of time needed for the clearing action. These changes resulted in a 4.64x speedup in overall pre-flight time from the baseline (down from 859.113 seconds to 185.139 seconds). ======================================================== Stage 2: Profiling now showed check_for_circular_servicedependency_path() as the bottleneck. So we made the following changes for the second stage of patching: * make a contiguous array for servicedependency_list, so it can be walked more efficiently, similar to above * factor some repeated pointer dereferencing out of loops * re-order servicedependency_struct members, based on timing tests and examination of the compiled assembly code * simplify some boolean tests (e.g., no need to test for (found==TRUE) [a specific non-zero value] when (found!=FALSE) or just (found) compares against zero [a highly optimized operation] and gives the same useful result; and no need to even store a transient boolean value in a memory location when it can be tested directly after production) * drop some redundant NULL-pointer tests Once again in this stage, extra time is now taken to prepare the new contiguous array, but the timing results clearly show the benefit far outweighs the cost. This stage of patching reduced the critical bottleneck in the processing to a single loop, compiled to 6 assembly instructions, one of which was a no-op filler instruction, with the loop having only one external memory reference per iteration. Basically, you can't get any tighter than that and do any useful work. The remaining performance problem was that this loop was being executed for 1.3 billion iterations when analyzing the customer dataset, which (even at only about 27.6 nanoseconds per iteration) still adds up to macroscopic time. (Remember what an O(n^^2) algorithm is all about, at the scale of 51031 service dependencies, half of which are execution dependencies and half of which are notification dependencies. (51031/2)^^2 + (51031/2)^^2 = 1.3 billion.) These changes resulted in a 20.0x speedup in overall pre-flight time from baseline (down from 859.113 seconds to 42.856 seconds), and a 4.32x speedup from the previous stage (down from 185.139 seconds to 42.856 seconds). ======================================================== Stage 3: check_for_circular_servicedependency_path() remained as the bottleneck, so we analyzed why we had so many iterations of its internal looping. That turned out to be because a lot of data in essentially random order (with respect to what we care about in this analysis) was being searched linearly, and again by jumping around in memory while walking the full length of a linked list -- another clear opportunity for structure+algorithm optimization. So we made the following changes for the third stage of patching: * sort the servicedependency_list contiguous array * use a specialty binary search to find the first dependent service we care about in the sorted list, if present * change the search for dependent services to examine only a short sequence of actual candidates, starting at that first dependent service * re-order more servicedependency_struct members, again based on a timing study * perform some trivial code cleanup from the previous patch stages One might wonder whether the additional O(n*log(n)) time now taken to sort might be a significant overhead to bear, but the timing results clearly show the benefit far outweighs the cost. Why do we use a specialty search instead of some standard library function? With respect to the usual suspects: * lfind(3) would be slow; its linear search of an array is exactly what we're trying to avoid here. * bsearch(3) [binary search of a sorted array] and tfind(3) [find a matching element in a binary tree] both return an unspecified element if there are multiple elements that match the key. What we need instead is to always find the first element in a contiguous series of matching elements, if any such elements exist, so that we can walk just the full length of that segment of the sorted array and then be done. So we need to implement our own routine to perform the search. This stage of patching avoids walking the full list of dependencies when only a few (now in contiguous sequence, and easy to quickly identify) need to be analyzed. Profiling showed that the dependency analysis, which was taking many hundreds of seconds in the original code, had now been reduced to only a few hundredths of a second; hence the claimed 20,000x speedup. These changes resulted in a 276.5x speedup in overall pre-flight time from baseline (down from 859.113 seconds to 3.107 seconds), and a 13.79x speedup from the previous stage (down from 42.856 seconds to 3.107 seconds). ======================================================== The discussion above reflects improvements in circular service dependency analysis. Exact corresponding changes were also made to improve circular host dependency analysis. That accounts for the nearly-identical code in the patch, as different data structures for hosts and services are being handled. In this regard, we are simply mirroring the structure of the code we are patching, following the requirements of C-language type handling. ======================================================== An additional part of testing involved purposely introducing a number of circular dependencies of various types into the test dataset, to establish that the revised code still found exactly the same errors as the original baseline code. ======================================================== What's the practical impact of this patch? Well, it can prevent lengthy monitoring outages when Nagios is bounced. For example, I would be completely unsurprised if the effect that Rodney Ramos reported a couple of weeks ago ("NSCA dropping packet with stale timestamp") is due to the existing dependency-analysis performance, if his configuration involves a lot of dependencies. While Nagios is spending time analyzing dependencies, it won't be reading its command pipe, and pipe writers such as NSCA will quickly fill the pipe and be unable to write more. That could stall its reading of its own client connections to the point that when it finally does read them, the packets it finds are then stale. Look back in the archives and you'll see that nobody responded to his request for help. This patch may be the solution for his situation. ======================================================== So what's the upshot of these changes, with regard to the previously published critique? * The several stages of patch development make it clear that algorithm changes and structure changes can go hand-in-hand to achieve significant performance improvements. * All of the structure changes were made based on observed timing improvements at the stage when they were implemented, not on guesswork about what the compiler might or might not be doing. * When we re-order certain structure members, there is a potential tradeoff: some member accesses will become much faster, while others that were presumably formerly fast will now become slower. This tradeoff seemed reasonable on the following grounds. The accesses we care about for circular dependency analysis are grouped together at a singular point in time, and have a very visible effect then, during process startup. Slower accesses for other members will be spread out during normal operation across the entire day, and should therefore be essentially unnoticeable, causing imperceptible and acceptable latency by eating very slightly into the processor idle time. No, we haven't analyzed the rest of the code, nor have we made timing measurements to prove this hypothesis, but that's our gut feeling. Yes, I know how I started this writeup. * All tests were run in an x86_64 environment, as these days it is the sweet spot of the computing industry. Five years ago, I would not have said that, but in the last couple of years, it has become difficult to purchase a serious machine which is not multi-core and 64-bit capable. That's clearly where the market is today and in the future, for anyone who cares about performance at scale (which is what this patch is intended to address). The x86_64 architecture suffers slightly with respect to 32-bit x86 because it uses more memory bandwidth to fetch wider objects, but it more than makes up for this by architectural extensions that make more registers available. * An i386 (32-bit x86) environment was not tested, nor did we run tests on UltraSPARC, PowerPC, Itanium, or any other architectures. Naturally, we have no guarantees that improvements targeted to the x86_64 won't slightly decrease performance on other machines. But given the wide popularity of the x86_64 architecture, targeting the most commonly used machines as the basis for optimization seems more than reasonable. Whatever small degradation might appear on these other platforms should be much more than made up for by the dramatic overall performance improvement afforded by the patch. Rather than fighting this patch because the structure member changes will break NEB compatibility in the 3.2.X releases, and wasting everyone's time trying to disentangle the synergistic effect of both algorithm and structure changes, perhaps a different approach is in order. The commentator has already expressed a strong desire to move to the 3.3.X release series: >> Do you have any ETA about the release date of the 3.3 series ? > No. Ethan is still the only one who can push the release button. > Otherwise I'd have cut a release quite a while ago. So perhaps adoption of this patch should be treated as the triggering event to move forward and declare this new minor release level. ======================================================== Glenn |