check scheduling when checks are inhibited.
Paul M. Dubuc
work at paul.dubuc.org
Tue Nov 30 20:11:11 CET 2010
Andreas,
Thanks for your reply to my earlier message. I've done some testing and some
more thinking on this since then:
On 11/23/2010 03:50 AM, Andreas Ericsson wrote:
> On 11/22/2010 10:41 PM, Paul M. Dubuc wrote:
>> We're using Nagios 3.2.3 for simulation of monitoring load in a load test
>> environment as well as for monitoring production services. I've notices some
>> interesting behavior in the way Nagios schedules checks when checks are
>> inhibited either though the CGI Process Commands or by setting a check_period
>> timeperiod that inhibits checks during regularly scheduled down times.
>>
>> Normally Nagios seems to spread out host and service checks evenly over time
>> but when checks are stopped with the Process Command, Nagios seems to
>> reschedule checks so that they are "bunched up" much closer together. This
>> creates alternating periods of densely scheduled and more sparsely scheduled
>> checks that seem to persist when checks are turned on again. It has a
>> noticeable effect in our load testing. The only way--or the quickest way--to
>> get Nagios to smooth out the schedule again is to stop the process completely
>> until all the scheduled check times have passed.
>>
>> In testing Nagios monitoring of our production services, if I use the
>> check_period to inhibit checks during our down times, I notice that as the
>> downtime approaches, ALL checks are rescheduled for the exact time that the
>> downtime ends (according to the check_period). This creates a big spike in
>> monitoring activity after the downtime. One way to avoid this, I think, is to
>> let checks run during the down times but inhibit notifications instead by
>> using the timeperiod to define a notification_period. But I wonder if this
>> "bunching" up of the schedule when using check_periods is ever a desirable
>> behavior.
>>
>
> I have some plans to make Nagios spread the checks with a randomized interleave
> factor so that a check scheduled to run once every 5 minutes can be run anywhere
> between 4m 30s and 5m 0s after it last ran. The 30 second random-spread would be
> the default and it would otherwise be configurable.
>
> Another thing worth looking into is to make services to the same host not run
> simultaneously, in case the checked server is expected to be loaded heavily
> it may not play nicely with 30-40 checks fired at it at once.
Here's another suggestion: An option that would tell Nagios to stagger the
scheduling of service checks when the check_period resumes. Instead of
scheduling all the checks for the exact time that the next check_period
begins, add an amount of time equal to the time past the check_period ending
that the service would have run if the check_period hadn't disabled checks.
For example, If I have a check period that is from 9:00 to 17:00 every day. A
service running every 5 minutes that runs at 16:57:14 would normally run at
17:02:14 if the check_period did not end at 17:00. This check would be
scheduled to run at 9:02:14 the next day instead of 9:00:00. This should keep
all checks staggered by the same amount of time in the schedule once the
check_period resumes.
I think this would be an ideal solution to the problem. Using the
auto_rescheduling options (discussed below) seems to help a little bit but not
as much as I'd hoped.
>
> You really should be using scheduled downtime for regular downtime though. There
> are pre-hacked solutions to automagically reschedule re-occurring downtime. Ninja
> supports it out of the box as of the latest version (or possibly latest git).
There are some cases where we really should not be running the checks during
down times because of the extra load they put on our system when they fail.
(Checks are still run during down times, if I'm not mistaken, only
notifications are inhibited.) Many of our checks fail in this case by timing
out and they use relatively scarce (shared) and resource intensive processes
(web browser sessions run under SeleniumRC). Timeouts tend to be long for
these checks so there is more contention for these processes when all the
checks using them start failing, and they're run more often until they all go
into a 'hard' failure state, etc. Maybe we can live with this, but it would
be easier on the system to just inhibit checks we know are going to fail
during certain regularly scheduled down times.
>
>> These aren't critical issues for us since we can work around them
>> procedurally.
>
> That's good to hear.
>
>> But I wonder if there his a way to prevent the scheduled checks
>> from getting bunched together like this if/when you need to inhibit checks for
>> a time while keeping Nagios running. Maybe the auto_rescheduling options in
>> the nagios.cfg are meant to address this, but they have a potentially negative
>> effect on performance according to the comments around them in the file.
>>
>
> The below text is what I'd call "educated speculation" after having thrown
> a quick glance at the code. I might be completely wrong, but I don't think
> so.
>
> Not potentially; They do have a negative sideeffect. This is because they
> maintain the scheduling intervals between checks stable over time by adding
> them to the scheduling queue all the time when they're supposed to run, but
> not actually executing them. So if you've scheduled downtime for 4 hours and
> have a default check-interval of 5 minutes, auto_rescheduling will schedule
> the check every 30 seconds (default) that entire time, but not actually run
> the check command unless it's time to do so.
>
> On the one hand, it shouldn't actually cause any major problems since it'll
> still do less than it would do were the checks enabled. On the other hand,
> it should be solvable without such hackery, but with the downside that
> a check executed 3 minutes before downtime started may not be executed again
> until a few minutes after downtime ends. That's how the auto_reschedule
> option works too though, if I'm reading the code correctly.
>
I haven't been able to find anything about how the auto_rescheduling options
actually work but, if it does work this way, why reschedule a check at 30
second intervals if it's not going to be run anyway until its regularly
scheduled time.
From reading the comments in the nagios.cfg (below), I get the impression
that, for these values, every 30 seconds Nagios will look at all the checks
that are scheduled within the next 180 second window and reschedule them to
spread them out equally over time.
I tried setting these options to watch what will happen when a check_period
ends and begins. When a check_period ends, Nagios still schedules all checks
affected by it at the exact time of the beginning of the next check_period
but, once the next check period begins, Nagios does seem to spread the checks
out a little bit. Some tests are still bunched together, scheduled to run on
the same second, others run one second later, still other groups are run
concurrenly 10 seconds later. I'm going to experiment more with different
interval and window values.
> # AUTO-RESCHEDULING OPTION
> # This option determines whether or not Nagios will attempt to
> # automatically reschedule active host and service checks to
> # "smooth" them out over time. This can help balance the load on
> # the monitoring server.
> # WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
> # PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
>
> auto_reschedule_checks=1
>
>
>
> # AUTO-RESCHEDULING INTERVAL
> # This option determines how often (in seconds) Nagios will
> # attempt to automatically reschedule checks. This option only
> # has an effect if the auto_reschedule_checks option is enabled.
> # Default is 30 seconds.
> # WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
> # PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
>
> auto_rescheduling_interval=30
>
>
>
> # AUTO-RESCHEDULING WINDOW
> # This option determines the "window" of time (in seconds) that
> # Nagios will look at when automatically rescheduling checks.
> # Only host and service checks that occur in the next X seconds
> # (determined by this variable) will be rescheduled. This option
> # only has an effect if the auto_reschedule_checks option is
> # enabled. Default is 180 seconds (3 minutes).
> # WARNING: THIS IS AN EXPERIMENTAL FEATURE - IT CAN DEGRADE
> # PERFORMANCE, RATHER THAN INCREASE IT, IF USED IMPROPERLY
>
> auto_rescheduling_window=180
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list