High check latency in a machine with low load
Mike Guthrie
mguthrie at nagios.com
Tue Oct 11 16:25:48 CEST 2011
If ndoutils starts to create a heavy burden on the system you can also
offload ndoutils/mysql to a second machine. We wrote the below document
for Nagios XI, but the doc has the info you'd need to make it work for
Nagios Core as well.
http://library.nagios.com/library/products/nagiosxi/documentation/462-offloading-mysql-to-remote-server
Javier Vela Diago wrote:
> I have a lot of custom checks, written mostly in perl, bash and some
> in python. And some take a lo of time.
>
> Nevermind, I think I found the solution, or at least one part. I
> configured to 1 the enable_large_instalallation_tweaks. This options,
> 6 months ago, almost crashed my system, so i discarded it. Now, with
> bigger problems, is the last thing that I wanted to test, but finally
> this afternoon I tested it.
>
> When I restarted Nagios, the load has started to grow until 6-8, and
> the latency problems dissapeared. I was sceptical about the utility of
> this options but when the load changes form 2,5 to 6, it means that
> the machine is doing a lot of work that before wasn't doing.
>
> Now the problem is that NDOUtils is causing some latency because of
> MYSQL, but well, at least I know what to optimize. Some tips will be
> apreciated :)
>
> Thank you and sorry for your time.
>
>
> De: Daniel Wittenberg <daniel.wittenberg.r0ko at statefarm.com>
> Para: Nagios Users List <nagios-users at lists.sourceforge.net>
> Fecha: 11/10/2011 16:02
> Asunto: Re: [Nagios-users] High check latency in a machine with
> low load
> ------------------------------------------------------------------------
>
>
>
> I think you have the enable_high_latency option enabled J j/k
>
> Do you have any particular checks that are taking a long time? i.e.
> can you watch top and see checks taking a while?
>
> Dan
>
>
> *From:* Javier Vela Diago [mailto:jvela at s2grupo.es] *
> Sent:* Tuesday, October 11, 2011 6:23 AM*
> To:* nagios-users at lists.sourceforge.net*
> Subject:* [Nagios-users] High check latency in a machine with low load
>
> Hi,
>
> I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services.
> This Nagios runs together with NDO and PNP (in bulk mode) in a server
> with 4GB of Ram and 4 cpus.
>
> One day I realized that the check delay in the performance CGI was
> very high (300-400 seconds). It was very strange so I took the tunning
> guide form nagios
> (_http://nagios.sourceforge.net/docs/3_0/tuning.html_) and applied all
> the points I could. In particular I adjusted the max_concurrent_checks
> to zero (no limit):
>
> max_concurrent_checks=0
>
> The reaper event:
>
> service_reaper_frequency=5
> max_check_result_reaper_time=15
>
> and checked that the host checks where not forced. In addition I
> configured 15 seconds of host check cache.
>
> cached_host_check_horizon=15
>
> But the problem remains. And the load of the server is not very high.
> Load of 2,5, 2 GB of free memory and an average utilization of disc of
> 7%. I disabled NDO and PNP but it was useless. After the first round
> of checks, the delay returns, while the load of the server doesn't grow.
>
> I have searched in google but all the problems area because of the
> load in the server, but here this is not the main problem. So my
> question is ¿what can I do now?¿There is some variable that shows me
> where to look? I'm a bit lost right now and I don't know how to find
> the problem.
>
> ¿Or maybe the only way is to configure a master-slave nagios in order
> to maximize the server utilization?
>
> In addition, I have pretty big timeouts (60 seconds) because of the
> high latency on the network. All your help is appreciated. Thank you
> in advance.
> *
> nagiostats*
> Nagios Stats 3.2.3
> Copyright (c) 2003-2008 Ethan Galstad (_www.nagios.org_)
> Last Modified: 10-03-2010
> License: GPL
>
> CURRENT STATUS DATA
> ------------------------------------------------------
> Status File:
> /usr/local/argos/aplicaciones/nagios/var/status.dat
> Status File Age: 0d 0h 0m 11s
> Status File Version: 3.2.3
>
> Program Running Time: 0d 20h 56m 7s
> Nagios PID: 21834
> Used/High/Total Command Buffers: 0 / 0 / 4096
>
> Total Services: 4032
> Services Checked: 4032
> Services Scheduled: 4030
> Services Actively Checked: 4032
> Services Passively Checked: 0
> Total Service State Change: 0.000 / 37.300 / 0.163 %
> Active Service Latency: 32.876 / 442.138 / 415.816 sec
> Active Service Execution Time: 0.051 / 60.097 / 1.545 sec
> Active Service State Change: 0.000 / 37.300 / 0.163 %
> Active Services Last 1/5/15/60 min: 237 / 1530 / 4020 / 4020
> Passive Service Latency: 0.000 / 0.000 / 0.000 sec
> Passive Service State Change: 0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
> Services Ok/Warn/Unk/Crit: 3766 / 38 / 44 / 184
> Services Flapping: 0
> Services In Downtime: 0
>
> Total Hosts: 931
> Hosts Checked: 931
> Hosts Scheduled: 931
> Hosts Actively Checked: 931
> Host Passively Checked: 0
> Total Host State Change: 0.000 / 12.370 / 0.077 %
> Active Host Latency: 0.000 / 441.308 / 416.063 sec
> Active Host Execution Time: 0.062 / 10.113 / 0.395 sec
> Active Host State Change: 0.000 / 12.370 / 0.077 %
> Active Hosts Last 1/5/15/60 min: 74 / 423 / 931 / 931
> Passive Host Latency: 0.000 / 0.000 / 0.000 sec
> Passive Host State Change: 0.000 / 0.000 / 0.000 %
> Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
> Hosts Up/Down/Unreach: 897 / 24 / 10
> Hosts Flapping: 0
> Hosts In Downtime: 1
>
> Active Host Checks Last 1/5/15 min: 109 / 535 / 1583
> Scheduled: 87 / 433 / 1300
> On-demand: 22 / 102 / 283
> Parallel: 87 / 438 / 1323
> Serial: 0 / 0 / 0
> Cached: 22 / 97 / 260
> Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
> Active Service Checks Last 1/5/15 min: 304 / 1605 / 4924
> Scheduled: 304 / 1605 / 4923
> On-demand: 0 / 0 / 1
> Cached: 0 / 0 / 0
> Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
>
> External Commands Last 1/5/15 min: 0 / 0 / 0
> *
> nagios -s*
>
> Nagios Core 3.2.3
> Copyright (c) 2009-2010 Nagios Core Development Team and Community
> Contributors
> Copyright (c) 1999-2009 Ethan Galstad
> Last Modified: 10-03-2010
> License: GPL
>
> Website: _http://www.nagios.org_ <http://www.nagios.org/>
> Warning: aggregate_status_updates directive ignored. All status file
> updates are now aggregated.
> Warning: downtime_file variable ignored. Downtime entries are now
> stored in the status and retention files.
> Warning: comment_file variable ignored. Comments are now stored in
> the status and retention files.
> Timing information on object configuration processing is listed
> below. You can use this information to see if precaching your
> object configuration would be useful.
>
> Object Config Source: Config files (uncached)
>
> OBJECT CONFIG PROCESSING TIMES (* = Potential for precache
> savings with -u option)
> ----------------------------------
> Read: 0.080036 sec
> Resolve: 0.010660 sec *
> Recomb Contactgroups: 0.002666 sec *
> Recomb Hostgroups: 0.004086 sec *
> Dup Services: 0.034632 sec *
> Recomb Servicegroups: 0.001277 sec *
> Duplicate: 0.010939 sec *
> Inherit: 0.005594 sec *
> Recomb Contacts: 0.000001 sec *
> Sort: 0.000000 sec *
> Register: 0.074413 sec
> Free: 0.008730 sec
> ============
> TOTAL: 0.234920 sec * = 0.071741 sec (30.54%)
> estimated savings
>
>
> RETENTION DATA TIMES
> ----------------------------------
> Read and Process: 0.495480 sec
> ============
> TOTAL: 0.495480 sec
>
>
> Timing information on configuration verification is listed below.
>
> CONFIG VERIFICATION TIMES (* = Potential for speedup with -x
> option)
> ----------------------------------
> Object Relationships: 0.060039 sec
> Circular Paths: 0.026557 sec *
> Misc: 0.005999 sec
> ============
> TOTAL: 0.092595 sec * = 0.026557 sec (28.7%) estimated
> savings
>
>
> EVENT SCHEDULING TIMES
> -------------------------------------
> Get service info: 0.014509 sec
> Get host info info: 0.002853 sec
> Get service params: 0.000078 sec
> Schedule service times: 0.039947 sec
> Schedule service events: 0.034656 sec
> Get host params: 0.000001 sec
> Schedule host times: 0.007519 sec
> Schedule host events: 0.029519 sec
> ============
> TOTAL: 0.129082 sec
>
>
> Projected scheduling information for host and service checks
> is listed below. This information assumes that you are going
> to start running Nagios with your current config files.
>
> HOST SCHEDULING INFORMATION
> ---------------------------
> Total hosts: 931
> Total scheduled hosts: 931
> Host inter-check delay method: SMART
> Average host check interval: 259.01 sec
> Host inter-check delay: 0.28 sec
> Max host check spread: 30 min
> First scheduled check: Tue Oct 11 13:14:08 2011
> Last scheduled check: Tue Oct 11 13:18:26 2011
>
>
> SERVICE SCHEDULING INFORMATION
> -------------------------------
> Total services: 4032
> Total scheduled services: 4030
> Service inter-check delay method: SMART
> Average service check interval: 299.55 sec
> Inter-check delay: 0.07 sec
> Interleave factor method: SMART
> Average services per host: 4.33
> Service interleave factor: 5
> Max service check spread: 30 min
> First scheduled check: Tue Oct 11 13:15:07 2011
> Last scheduled check: Tue Oct 11 13:20:07 2011
>
>
> CHECK PROCESSING INFORMATION
> ----------------------------
> Check result reaper interval: 5 sec
> Max concurrent service checks: Unlimited
>
>
> PERFORMANCE SUGGESTIONS
> -----------------------
> I have no suggestions - things look okay.
> --
> Javier Vela Diago
> S2 GRUPO
> Ramiro de Maeztu, 7 bajo. 46022 Valencia
> Tel: 963.110.300 Fax: 963.106.086
> e-mail : jvela arroba s2grupo punto es_
> __http://www.s2grupo.es_
> <http://www.s2grupo.es/>------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
--
Mike Guthrie
Technical Team
___
Nagios Enterprises, LLC
Email: mguthrie at nagios.com
Web: www.nagios.com
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list