High Service Check Latency
Simone Felici
s.felici at mclink.eu
Tue May 22 09:46:31 CEST 2012
Hello!
Yes, it's a common problem, but cannot figure out how to debug it.
I've a distributed setup with a master server collecting >9.000 passive services sent from other
servers, all with active latencies near 0. The master server checks *only* itself as active
services, ~40 services, most of them every 5 minutes. AFAIK passive services should not affect
"active service check latency" statistics. Looking into retention.dat file, the high latencies are
all related to the local executed active services. Actual stats:
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 7s
Status File Version: 3.2.3
Program Running Time: 0d 20h 40m 53s
Nagios PID: 9360
Used/High/Total Command Buffers: 0 / 7 / 10000
Total Services: 9098
Services Checked: 9098
Services Scheduled: 33
Services Actively Checked: 39
Services Passively Checked: 9059
Total Service State Change: 0.000 / 100.000 / 1.351 %
Active Service Latency: 4.156 / 7943.743 / 6163.392 sec <<<<<<<<
Active Service Execution Time: 0.010 / 2.485 / 0.319 sec
Active Service State Change: 0.000 / 22.890 / 2.443 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Service Latency: 0.088 / 7.914 / 1.997 sec
Passive Service State Change: 0.000 / 100.000 / 1.346 %
Passive Services Last 1/5/15/60 min: 1851 / 7501 / 8084 / 8392
Services Ok/Warn/Unk/Crit: 8784 / 78 / 76 / 160
Services Flapping: 4
Services In Downtime: 112
Total Hosts: 1912
Hosts Checked: 1912
Hosts Scheduled: 0
Hosts Actively Checked: 74
Host Passively Checked: 1838
Total Host State Change: 0.000 / 46.910 / 0.135 %
Active Host Latency: 0.000 / 1425.848 / 1104.205 sec
Active Host Execution Time: 0.012 / 0.402 / 0.096 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host Latency: 0.000 / 639.353 / 1.197 sec
Passive Host State Change: 0.000 / 46.910 / 0.140 %
Passive Hosts Last 1/5/15/60 min: 1 / 12 / 27 / 70
Hosts Up/Down/Unreach: 1850 / 57 / 5
Hosts Flapping: 0
Hosts In Downtime: 35
Active Host Checks Last 1/5/15 min: 42 / 194 / 565
Scheduled: 0 / 0 / 0
On-demand: 42 / 194 / 565
Parallel: 0 / 0 / 0
Serial: 0 / 0 / 0
Cached: 42 / 194 / 565
Passive Host Checks Last 1/5/15 min: 1 / 14 / 45
Active Service Checks Last 1/5/15 min: 0 / 0 / 0
Scheduled: 0 / 0 / 0
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 2311 / 9235 / 12988
External Commands Last 1/5/15 min: 0 / 1 / 1
I've some broker modules to handle sql logging and distributed setup. Other parameters that could be
interesting:
command_check_interval=-1
service_inter_check_delay_method=s
max_concurrent_checks=80
check_result_reaper_frequency=2
max_check_result_reaper_time=30
obsess_over_services=0
obsess_over_hosts=0
Looking on suggesions by the proc:
Nagios Core 3.2.3
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 10-03-2010
License: GPL
Website: http://www.nagios.org
Timing information on object configuration processing is listed
below. You can use this information to see if precaching your
object configuration would be useful.
Object Config Source: Config files (uncached)
OBJECT CONFIG PROCESSING TIMES (* = Potential for precache savings with -u option)
----------------------------------
Read: 0.703470 sec
Resolve: 0.018964 sec *
Recomb Contactgroups: 0.454370 sec *
Recomb Hostgroups: 0.010414 sec *
Dup Services: 0.025101 sec *
Recomb Servicegroups: 0.000211 sec *
Duplicate: 0.003912 sec *
Inherit: 0.008386 sec *
Recomb Contacts: 0.000000 sec *
Sort: 0.000003 sec *
Register: 0.050582 sec
Free: 0.006160 sec
============
TOTAL: 1.281574 sec * = 0.521362 sec (40.68%) estimated savings
RETENTION DATA TIMES
----------------------------------
Read and Process: 0.514352 sec
============
TOTAL: 0.514352 sec
Timing information on configuration verification is listed below.
CONFIG VERIFICATION TIMES (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.185991 sec
Circular Paths: 0.020317 sec *
Misc: 0.009450 sec
============
TOTAL: 0.215758 sec * = 0.020317 sec (9.4%) estimated savings
EVENT SCHEDULING TIMES
-------------------------------------
Get service info: 0.014388 sec
Get host info info: 0.002899 sec
Get service params: 0.000010 sec
Schedule service times: 0.000679 sec
Schedule service events: 0.000231 sec
Get host params: 0.000000 sec
Schedule host times: 0.000102 sec
Schedule host events: 0.000051 sec
============
TOTAL: 0.018360 sec
Projected scheduling information for host and service checks
is listed below. This information assumes that you are going
to start running Nagios with your current config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 1912
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 15 min
First scheduled check: N/A
Last scheduled check: N/A
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 9098
Total scheduled services: 33
Service inter-check delay method: SMART
Average service check interval: 1770.91 sec
Inter-check delay: 9.09 sec
Interleave factor method: SMART
Average services per host: 4.76
Service interleave factor: 1
Max service check spread: 5 min
First scheduled check: Tue May 22 09:41:22 2012
Last scheduled check: Tue May 22 09:46:12 2012
CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval: 2 sec
Max concurrent service checks: 80
PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
If I force a schedule of an active check, I can see how the force is immediatly logged into
nagios.log, but executed with the high delay.
Is there a way I can debug or what parameter should I tune? Increasing logging could help?
I've still looked on the nagios tuning page, but doesn't help me much. Some suggestions based on the
information provided?
Thank's a lot!
Simon
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list