Slow scheduled service checks
Tedman Eng
teng at dataway.com
Tue Sep 21 01:45:16 CEST 2004
The (s)mart inter-check delay is based on the premise that all/most checks
are regularly scheduled (let's say 5 minutes). If you have many checks that
don't run at these intervals (say every 2 hrs), then these long-interval
checks skew the calculation. The normal formula is "add up the total check
times, then divide by the number of checks". This results in an "average",
which is used as the number for 's'. The problem is that the average will
be skewed in the direction of the long checks, and thus make the
short-interval checks scheduled later than their desired check-interval.
I'm not sure if there is an easy fix to this behavior (some sort of weighted
average?). It is not changed in current versions of Nagios. Consider it an
implementation detail. :-)
-----Original Message-----
From: Jeff Engstrom [mailto:jeff.engstrom at fortix.net]
Sent: Monday, September 20, 2004 4:13 PM
To: nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Slow scheduled service checks
That fixed my recheck times! Thanks!!
I find it strange however, that the "smart" setting was not working...
I am running Nagios 1.1 and I wonder if 1.2 fixes this issue??
On Mon, 2004-09-20 at 14:30, Tedman Eng wrote:
> Check latency is indeed very high on your system. It is the time between
> when a check is supposed to run and when it actually gets run. By
> comparison, it should be between 1-30 seconds, depending on network
> conditions and nagios load.
>
> If you have a very large number of down hosts, this can also affect your
> latency, since Nagios "pauses" to check a host and thus skews the
scheduling
> queue when this happens. It can usually catch up though if the other
checks
> have enough headroom in the scheduling queue.
>
> Look at your scheduling queue (best done right after a restart). The
checks
> should be spaced out evenly. If your normal check interval for most
> services is 5 minutes, look to see that all of your services are scheduled
> to complete before that 5 minutes is up.
>
> Try manually setting your inter-check-delay.
> Your value should be just below .5 (every half second per check) if you
have
> 600 services actively checked.
>
> -----Original Message-----
> From: Jeff Engstrom [mailto:jeff.engstrom at fortix.net]
> Sent: Monday, September 20, 2004 2:01 PM
> To: Nagios-Users
> Cc: teng at dataway.com
> Subject: RE: [Nagios-users] Slow scheduled service checks
>
>
> Here is the servers performance metrics...
>
> Time Frame Checks Completed
> <= 1 minute: 35 (5.3%)
> <= 5 minutes: 249 (37.5%)
> <= 15 minutes: 664 (100.0%)
> <= 1 hour: 664 (100.0%)
> Since program start: 664 (100.0%)
>
> Metric Min. Max. Average
> Check Execution Time: < 1 sec 5 sec 0.396 sec
> Check Latency: 359 sec 476 sec 415.349 sec
> Percent State Change: 0.00% 17.04% 0.03%
>
> I don't have any excessively long check intervals as you might notice
> from the data above. The check latency seems high to me but I don't
> have a complete understanding of what the value represents.
>
> Thanks again!
> Jeff
>
>
> On Mon, 2004-09-20 at 13:24, Tedman Eng wrote:
> > Please let us know your performance metrics
> >
> > Check execution times and check lantency (table in the top right).
> > Would also be helpful to see active check completion rate (table in the
> top
> > left)
> >
> > These should help pinpoint where the slowdown is.
> >
> >
> > Also to optimize, if you have some checks that are long-intervalled (run
> > only once every day, etc), you should consider hand calculating the
> > inter-check-delay rather than using the 's' method. Use the formula
from
> > the documentation, but toss out any long-interval checks, since they'll
> > adversely skew the calculations.
> >
> >
> > -----Original Message-----
> > From: Jeff Engstrom [mailto:jeff.engstrom at fortix.net]
> > Sent: Monday, September 20, 2004 10:41 AM
> > To: nagios-users at lists.sourceforge.net
> > Subject: [Nagios-users] Slow scheduled service checks
> >
> >
> > Hello all,
> >
> > I have a server monitoring some 1500 points and it seems for the most
> > part to run quite well. However, for one reason or another the "Last
> > Check" times are off when a service is down. That is not the only
> > problem actually... it appears that it can take some 15mins after the
> > service is restored for the update to reach the interface.
> >
> > The main cfg is detailed below...
> >
> > check_external_commands=1
> > command_check_interval=-1
> > log_rotation_method=d
> > use_syslog=1
> > log_notifications=1
> > log_service_retries=1
> > log_host_retries=1
> > log_event_handlers=1
> > log_initial_states=1
> > log_external_commands=1
> > log_passive_service_checks=1
> > inter_check_delay_method=s
> > service_interleave_factor=s
> > max_concurrent_checks=18
> > service_reaper_frequency=3
> > sleep_time=1
> > service_check_timeout=60
> > host_check_timeout=60
> > event_handler_timeout=30
> > notification_timeout=30
> > ocsp_timeout=5
> > perfdata_timeout=5
> > retain_state_information=1
> > retention_update_interval=60
> > use_retained_program_state=0
> > interval_length=60
> > use_agressive_host_checking=0
> > execute_service_checks=1
> > accept_passive_service_checks=1
> > enable_notifications=1
> > enable_event_handlers=1
> > process_performance_data=0
> > obsess_over_services=1
> > ocsp_command=submit_check_result
> > check_for_orphaned_services=1
> > check_service_freshness=1
> > freshness_check_interval=60
> > aggregate_status_updates=1
> > status_update_interval=15
> > enable_flap_detection=1
> > low_service_flap_threshold=5.0
> > high_service_flap_threshold=20.0
> > low_host_flap_threshold=5.0
> > high_host_flap_threshold=20.0
> >
> > Thanks for any help on this!
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> > Project Admins to receive an Apple iPod Mini FREE for your judgement on
> > who ports your project to Linux PPC the best. Sponsored by IBM.
> > Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS when
> reporting
> > any issue.
> > ::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list