Problem with high latencies after going distributed
Frost, Mark {PBG}
mark.frost1 at pepsi.com
Tue Jan 22 17:34:28 CET 2008
As I'd mentioned in a previous message, I'm in the process of converting
from a centralized
Nagios 2.10 setup all running on a single host to a distributed setup
running on at least 3
hosts (3 to start anyway). The centralized setup has 572 hosts and 2900
services 99.9% of which are active checks.
My approach when going passive was to group the checks so that some ran
on the first distributed node and some ran on the second node. The
central server will do freshness
checking and run an active check if it fails to get a check back from a
distributed node
after 20 minutes (virtually all checks run at 15 minute intervals or
less).
Our old centralized server reports the following via nagiostats:
Active Service Latency: 0.000 / 9.129 / 0.833 sec
Active Service Execution Time: 0.037 / 10.045 / 0.227 sec
I started noticing a fair number of checks going stale on the new
reporting server and
that server would then run those service checks actively. I could see
no reason for this.
When I had a look at the distributed nodes, I saw:
Distributed Node 1 (min/max/avg)
Active Service Latency: 0.000 / 7267.198 /
4241.019 sec
Active Service Execution Time: 0.000 / 60.014 / 0.651 sec
Distributed Node 2 (min/max/avg)
Active Service Latency: 0.000 / 11475.901 /
6393.641 sec
Active Service Execution Time: 0.000 / 60.018 / 0.593 sec
Wow.
I reviewed the performance doc for Nagios 2.x yet again and I'm not
finding anything there
that I'm not doing that would affect latencies this much. These boxes
are dedicated to Nagios
so there's no other application competing for resources. They're on the
same subnet.
I run a few perl checks, but that would be a very small percentage of my
checks. The
distributed nodes are newer and have more resources (faster CPU, at
least as much memory)
as the old standalone box.
The only thing I can think of that could be unusual is that both
distributed nodes know about
all hosts and services. I have created a configuration whereby
hosts/services that are not
to be checked by node 1 are given a template that looks like:
define service {
name nagios-dist-check-service
freshness_threshold 1200
active_checks_enabled 1
check_freshness 0
check_period 24x7
event_handler_enabled 0
flap_detection_enabled 0
notifications_enabled 0
obsess_over_service 1
passive_checks_enabled 0
process_perf_data 0
register 0
}
define service {
name nagios-dist-nocheck-service
freshness_threshold 1200
active_checks_enabled 0
check_freshness 0
check_period none
event_handler_enabled 0
flap_detection_enabled 0
notifications_enabled 0
obsess_over_service 0
passive_checks_enabled 0
process_perf_data 0
register 0
}
So services on node1 that are supposed to be run, get the
nagios-dist-check-service template
and those that should not, get the nagios-dist-nocheck-service template.
Is there something about Nagios that I don't understand that would cause
a lot of disabled
service checks to shoot latencies way up? Is something else going on
here?
Here's my output of nagios -s on one of the nodes (both yield similar
output and are configured similarly):
Nagios 2.10
Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org)
Last Modified: 10-21-2007
License: GPL
Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 569
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 2917
Total scheduled services: 1122
Service inter-check delay method: SMART
Average service check interval: 385.13 sec
Inter-check delay: 0.34 sec
Interleave factor method: SMART
Average services per host: 5.13
Service interleave factor: 2
Max service check spread: 30 min
First scheduled check: Tue Jan 22 11:35:47 2008
Last scheduled check: Tue Jan 22 11:42:12 2008
CHECK PROCESSING INFORMATION
----------------------------
Service check reaper interval: 2 sec
Max concurrent service checks: Unlimited
PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
Any help is greatly appreciated. Thanks.
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list