High check latency in a machine with low load
Javier Vela Diago
jvela at s2grupo.es
Tue Oct 11 16:16:55 CEST 2011
I have a lot of custom checks, written mostly in perl, bash and some in
python. And some take a lo of time.
Nevermind, I think I found the solution, or at least one part. I
configured to 1 the enable_large_instalallation_tweaks. This options, 6
months ago, almost crashed my system, so i discarded it. Now, with bigger
problems, is the last thing that I wanted to test, but finally this
afternoon I tested it.
When I restarted Nagios, the load has started to grow until 6-8, and the
latency problems dissapeared. I was sceptical about the utility of this
options but when the load changes form 2,5 to 6, it means that the machine
is doing a lot of work that before wasn't doing.
Now the problem is that NDOUtils is causing some latency because of
MYSQL, but well, at least I know what to optimize. Some tips will be
apreciated :)
Thank you and sorry for your time.
De: Daniel Wittenberg <daniel.wittenberg.r0ko at statefarm.com>
Para: Nagios Users List <nagios-users at lists.sourceforge.net>
Fecha: 11/10/2011 16:02
Asunto: Re: [Nagios-users] High check latency in a machine with low load
I think you have the enable_high_latency option enabled J j/k
Do you have any particular checks that are taking a long time? i.e. can
you watch top and see checks taking a while?
Dan
From: Javier Vela Diago [mailto:jvela at s2grupo.es]
Sent: Tuesday, October 11, 2011 6:23 AM
To: nagios-users at lists.sourceforge.net
Subject: [Nagios-users] High check latency in a machine with low load
Hi,
I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services. This
Nagios runs together with NDO and PNP (in bulk mode) in a server with 4GB
of Ram and 4 cpus.
One day I realized that the check delay in the performance CGI was very
high (300-400 seconds). It was very strange so I took the tunning guide
form nagios (http://nagios.sourceforge.net/docs/3_0/tuning.html) and
applied all the points I could. In particular I adjusted the
max_concurrent_checks to zero (no limit):
max_concurrent_checks=0
The reaper event:
service_reaper_frequency=5
max_check_result_reaper_time=15
and checked that the host checks where not forced. In addition I
configured 15 seconds of host check cache.
cached_host_check_horizon=15
But the problem remains. And the load of the server is not very high. Load
of 2,5, 2 GB of free memory and an average utilization of disc of 7%. I
disabled NDO and PNP but it was useless. After the first round of checks,
the delay returns, while the load of the server doesn't grow.
I have searched in google but all the problems area because of the load in
the server, but here this is not the main problem. So my question is ¿what
can I do now?¿There is some variable that shows me where to look? I'm a
bit lost right now and I don't know how to find the problem.
¿Or maybe the only way is to configure a master-slave nagios in order to
maximize the server utilization?
In addition, I have pretty big timeouts (60 seconds) because of the high
latency on the network. All your help is appreciated. Thank you in
advance.
nagiostats
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/argos/aplicaciones/nagios/var/status.dat
Status File Age: 0d 0h 0m 11s
Status File Version: 3.2.3
Program Running Time: 0d 20h 56m 7s
Nagios PID: 21834
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 4032
Services Checked: 4032
Services Scheduled: 4030
Services Actively Checked: 4032
Services Passively Checked: 0
Total Service State Change: 0.000 / 37.300 / 0.163 %
Active Service Latency: 32.876 / 442.138 / 415.816 sec
Active Service Execution Time: 0.051 / 60.097 / 1.545 sec
Active Service State Change: 0.000 / 37.300 / 0.163 %
Active Services Last 1/5/15/60 min: 237 / 1530 / 4020 / 4020
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 3766 / 38 / 44 / 184
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 931
Hosts Checked: 931
Hosts Scheduled: 931
Hosts Actively Checked: 931
Host Passively Checked: 0
Total Host State Change: 0.000 / 12.370 / 0.077 %
Active Host Latency: 0.000 / 441.308 / 416.063 sec
Active Host Execution Time: 0.062 / 10.113 / 0.395 sec
Active Host State Change: 0.000 / 12.370 / 0.077 %
Active Hosts Last 1/5/15/60 min: 74 / 423 / 931 / 931
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 897 / 24 / 10
Hosts Flapping: 0
Hosts In Downtime: 1
Active Host Checks Last 1/5/15 min: 109 / 535 / 1583
Scheduled: 87 / 433 / 1300
On-demand: 22 / 102 / 283
Parallel: 87 / 438 / 1323
Serial: 0 / 0 / 0
Cached: 22 / 97 / 260
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 304 / 1605 / 4924
Scheduled: 304 / 1605 / 4923
On-demand: 0 / 0 / 1
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
nagios -s
Nagios Core 3.2.3
Copyright (c) 2009-2010 Nagios Core Development Team and Community
Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 10-03-2010
License: GPL
Website: http://www.nagios.org
Warning: aggregate_status_updates directive ignored. All status file
updates are now aggregated.
Warning: downtime_file variable ignored. Downtime entries are now stored
in the status and retention files.
Warning: comment_file variable ignored. Comments are now stored in the
status and retention files.
Timing information on object configuration processing is listed
below. You can use this information to see if precaching your
object configuration would be useful.
Object Config Source: Config files (uncached)
OBJECT CONFIG PROCESSING TIMES (* = Potential for precache savings
with -u option)
----------------------------------
Read: 0.080036 sec
Resolve: 0.010660 sec *
Recomb Contactgroups: 0.002666 sec *
Recomb Hostgroups: 0.004086 sec *
Dup Services: 0.034632 sec *
Recomb Servicegroups: 0.001277 sec *
Duplicate: 0.010939 sec *
Inherit: 0.005594 sec *
Recomb Contacts: 0.000001 sec *
Sort: 0.000000 sec *
Register: 0.074413 sec
Free: 0.008730 sec
============
TOTAL: 0.234920 sec * = 0.071741 sec (30.54%) estimated
savings
RETENTION DATA TIMES
----------------------------------
Read and Process: 0.495480 sec
============
TOTAL: 0.495480 sec
Timing information on configuration verification is listed below.
CONFIG VERIFICATION TIMES (* = Potential for speedup with -x
option)
----------------------------------
Object Relationships: 0.060039 sec
Circular Paths: 0.026557 sec *
Misc: 0.005999 sec
============
TOTAL: 0.092595 sec * = 0.026557 sec (28.7%) estimated
savings
EVENT SCHEDULING TIMES
-------------------------------------
Get service info: 0.014509 sec
Get host info info: 0.002853 sec
Get service params: 0.000078 sec
Schedule service times: 0.039947 sec
Schedule service events: 0.034656 sec
Get host params: 0.000001 sec
Schedule host times: 0.007519 sec
Schedule host events: 0.029519 sec
============
TOTAL: 0.129082 sec
Projected scheduling information for host and service checks
is listed below. This information assumes that you are going
to start running Nagios with your current config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 931
Total scheduled hosts: 931
Host inter-check delay method: SMART
Average host check interval: 259.01 sec
Host inter-check delay: 0.28 sec
Max host check spread: 30 min
First scheduled check: Tue Oct 11 13:14:08 2011
Last scheduled check: Tue Oct 11 13:18:26 2011
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 4032
Total scheduled services: 4030
Service inter-check delay method: SMART
Average service check interval: 299.55 sec
Inter-check delay: 0.07 sec
Interleave factor method: SMART
Average services per host: 4.33
Service interleave factor: 5
Max service check spread: 30 min
First scheduled check: Tue Oct 11 13:15:07 2011
Last scheduled check: Tue Oct 11 13:20:07 2011
CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval: 5 sec
Max concurrent service checks: Unlimited
PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
--
Javier Vela Diago
S2 GRUPO
Ramiro de Maeztu, 7 bajo. 46022 Valencia
Tel: 963.110.300 Fax: 963.106.086
e-mail : jvela arroba s2grupo punto es
http://www.s2grupo.es
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20111011/4a2a60e0/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list