trying to fix problem with excessive latency
Corey Hickey
bugfood-ml at fatooh.org
Wed May 19 03:29:56 CEST 2010
Hello,
I have inherited maintenance of a medium-sized Nagios installation. We
currently have 649 hosts and 5415 services. Our setup works nicely, with
one exception: Nagios falls behind on host/service checks. Our usual
latency once Nagios has been running for a while is about 190-200
seconds. Our Nagios host is reasonably powerful and isn't struggling; it
seems that Nagios itself is limited somehow.
I've searched google and read every relevant document I could find,
including the tuning page:
http://nagios.sourceforge.net/docs/3_0/tuning.html
So far I haven't been able to find anything wrong with our
configuration, and my experimental tuning hasn't resulted in any
improvement. As far as I can tell, Nagios is scheduling the host/service
checks properly, but not processing the queue aggressively enough.
Some notes:
1. The Nagios host has 8 2GHz cores and is usually 75-85% idle. Out of 4
GB of memory, 1.2 GB is free, with no swap usage. We don't seem to be
running into any physical limitations.
2. Raising max_concurrent_checks doesn't help; 'nagios -s' recommends a
value of at least 599, so we're using 1200. I've tried absurdly high
values like 6000, with no improvement.
3. Lowering service_reaper_frequency to 2 doesn't seem to help; in any
case, our latency of 190 is way higher than the service_reaper_frequency.
4. I tried setting max_check_result_reaper_time to 30; no change. I
don't know what I should set this to.
5. I tried disabling all host check scheduling (setting check_interval
to 0 in our host template); that may have helped (I'm seeing 173 second
latency instead of 190) but didn't really solve the problem.
I'm attaching our main nagios.cfg file and including the output of
nagiostats below.
The host is running 64-bit CentOS 5.4 with a 2.6.18 kernel.
-----------------------------------------------------------------------
Nagios Stats 3.2.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 03-09-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /var/log/nagios/status.log
Status File Age: 0d 0h 0m 6s
Status File Version: 3.2.1
Program Running Time: 0d 0h 18m 22s
Nagios PID: 1556
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 5415
Services Checked: 5415
Services Scheduled: 5415
Services Actively Checked: 5415
Services Passively Checked: 0
Total Service State Change: 0.000 / 30.390 / 0.024 %
Active Service Latency: 5.878 / 197.462 / 194.633 sec
Active Service Execution Time: 0.020 / 120.007 / 0.847 sec
Active Service State Change: 0.000 / 30.390 / 0.024 %
Active Services Last 1/5/15/60 min: 767 / 4236 / 5412 / 5415
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 5358 / 6 / 0 / 51
Services Flapping: 1
Services In Downtime: 22
Total Hosts: 649
Hosts Checked: 649
Hosts Scheduled: 649
Hosts Actively Checked: 649
Host Passively Checked: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.000 / 196.614 / 194.274 sec
Active Host Execution Time: 0.020 / 11.019 / 0.069 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 91 / 506 / 649 / 649
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 646 / 3 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 101 / 536 / 1609
Scheduled: 98 / 520 / 1562
On-demand: 3 / 16 / 47
Parallel: 99 / 522 / 1566
Serial: 0 / 0 / 0
Cached: 3 / 15 / 44
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 872 / 4360 / 13101
Scheduled: 872 / 4360 / 13101
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
-----------------------------------------------------------------------
I have a feeling I'm missing something.... I would appreciate any
suggestions.
Thanks,
Corey
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagios.cfg
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100518/e75cc59b/attachment.ksh>
-------------- next part --------------
------------------------------------------------------------------------------
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list