Patch RFC - Nagios 3.2 - permanently remove sleep on run_event == FALSE in main loop (events.c) or conditionally remove using nagios.cfg configuration parameter?

Max perldork at webwizarddesign.com
Sun Nov 1 23:20:33 CET 2009


Keep in mind there are two sleep sections, one that happens when an
non-runnable event is encountered (that is the one we commented out)
and another when the schedule is empty (we let that one alone).

I have seen that some nanosleep implementations drive the cpu more
than usleep asw nanosleep can doa busy wait.

No idea why this is the case though on rhel.

Pre patch we went about 12 hours on rhel 5.4 without enough latency to
require a restart ... Post patch we went almost 3 days on our test box
(1400+ hosts, about 8500 services).

Pretty big difference :).

Max

On 11/1/09, Christoph Maser <cmr at financial.com> wrote:
> Am Freitag, den 30.10.2009, 16:32 +0100 schrieb Max:
>> Hi,
>>
>> We have been working on reducing the scheduling skew for Nagios
>> service checks through a number of different techniques; yesterday we
>> were looking through the main event loop in events.c and saw that when
>> an event is encountered that is *NOT* scheduled to run, Nagios sleeps
>> the sleep_time amount configured in nagios.cfg with a comment about
>> not hogging CPU.
>>
>> While this certainly can be a useful thing to do for environments with
>> less powerful hardware or where performance data intervals are not as
>> critical as 'playing nice' is, it adds a lot of scheduling skew to
>> Nagios for environments (like ours) that have requirements to get
>> performance data into other systems at very regular intervals and if
>> nanosleep is used, it actually drives the load up on the system over
>> time ( on RHEL 5.1, 5.2, and 5.4 at least).
>>
>> We commented out that code in our environment yesterday and noticed that:
>> * Our latency increase over time decreased significantly
>> * System load decreased noticeably as nanosleep is not being called
>> thousands of times in a polling cycle (test env has 9000 active
>> services on ~ 1400 hosts with ~ 800 not runnable due to service
>> dependency rules)
>>
>> To give real numbers, our latency pre-patch was going from 0 to 12
>> seconds within about 10 hours; post patch latency has only increased
>> to about 1 second after 14 hours of running on this build.  We measure
>> when latency is too high by when our SNMP counter-based check
>> intervals increase to the point that we are 10% more than the
>> configured interval (e.g. 330 seconds if the interval is 300 seconds)
>> as that then causes gaps in the time series data warehouse we send our
>> performance data to.
>>
>> Pre patch load after 12-14 hours was increasing to 7, post patch after
>> 14 hours system load has levelled off around 3-4 .. this is on a dual
>> quad core intel system with 8 GB RAM.  Service check performance /
>> minute is around 2k checks.
>>
>> So while this was a trivial thing to change, for a larger environment
>> it makes a very noticeable difference in performance and we would like
>> to contribute it as a performance patch.
>>
>> So I am thinking that we could conditionally perform that additional
>> sleep if use_large_installation_tweaks in nagios.cfg is set to 0
>> instead of just removing the code and submit that as our patch.
>>
>> Thoughts / opinions?
>>
>> - Max
>
> Isn't that the whole point of the sleep_time config value? You could set
> that to 0.01 maybe even to 0. But zero really has the problem that you
> basically run a nearly empty infinite loop on smaller systems.
> About the nanosleep RHEL issue, do you have some more information on
> that? Why does it drive up the load over time?
>
> Chris
>
>
> financial.com AG
>
> Munich head office/Hauptsitz München: Maria-Probst-Str. 19 | 80939 München |
> Germany
> Frankfurt branch office/Niederlassung Frankfurt: Messeturm |
> Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany
> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer |
> Dr. Yann Samson | Matthias Wiederwach
> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden
> (chairman/Vorsitzender)
> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID
> number/St.Nr.: DE205 370 553
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference




More information about the Developers mailing list