massive service check latencies
Andreas Ericsson
ae at op5.se
Wed Mar 23 17:45:30 CET 2005
Ben wrote:
> I've been having a horrible time with service check latencies. I've got
> ~6k services so I thought at first maybe my hardware couldn't keep up.
> But after moving to much beefier hardware, things have actually gotten
> worse, not better. So I figured, I'd been running a recent beta...
> maybe one of the new checkins fixed something. I tried to pull down the
> latest from CVS this morning, and it has the same situation.
>
I assume you're running the very latest of the 2.x branch then.
> So now I think I just have a basic misunderstanding of the way nagios
> schedules checks. Here's how I've tweaked my settings to try to make
> things run more frequently:
>
> service_inter_check_delay_method=n
From nagios docs, regarding service_inter_check_delay_method;
n = Don't use any delay - schedule all service checks to run immediately
(i.e. at the same time!)
Perhaps this would be better of as 0.3 or s (s meaning nagios determines
how often it needs to check things).
> max_service_check_spread=60
With this statement you're telling nagios to spread its checks over an
entire hour. The docs also say that this overrides
service_inter_check_delay ("if necessary", whatever that means).
> service_interleave_factor=s
Seems correct.
> host_inter_check_delay_method=n
> max_host_check_spread=60
Either you've overconfigured your nagios, or you have enabled scheduled
hostchecks without reading the docs about it. Host checks are executed
in serial (one at a time), so you'll see some serious service check
latencies if you have them enabled.
> max_concurrent_checks=0
> service_reaper_frequency=5
>
This seems right, but if load isn't high you should set
service_reaper_frequency lower. Try 2 or something.
> What I notice is that checks are queued up several dozen at a time, and
> that they all have to finish before the next batch can begin.
Non-true. Service checks are scheduled and run on-demand. Scheduled
hostchecks fuck up the service check scheduling.
> As far as I
> can tell, there is no way to make the size of the batch grow, or to stop
> waiting for all checks to finish before moving on. The hardware (dual 2.8
> xeon with 2.5GB of ram dedicated to monitoring) is not at all stressed.
>
>
> Interestingly, while my service check latencies average around 500
> seconds, my host check latencies are well under 1 second, which is what I
> would expect. FWIW, I've got about 2300 hosts.
>
> Oh, and the average execution time for both service and host checks is
> about 3 seconds.
>
With perl checks you can most likely cut that to 20% with this simple
sed line;
sed -i -e 's/\(^#.*/bin/perl\).*/\1/' -e 's/use strict;/# \&/'
sed 4.0.9 or higher required (for the -i switch). In effect, it removes
the strict pragma and all switches (such as -wT) for perl.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Lead Developer
-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list