Nagios and Gearman - huge environment performance problem
Daniel Wittenberg
daniel.wittenberg.r0ko at statefarm.com
Wed Aug 24 16:37:02 CEST 2011
I noticed from the output you have a high amount of unknown and critical services. Are those taking a long time to timeout? What you might try, which I know isn't ideal, but removing certain checks that might be failing, like just start with host checks, and when those show good, add a few more services, few more, etc. until you notice the time going through the roof again. That might help figure out where your threshold is, and if there are certain checks that are causing issues. Is this a physical or virtual server?
Dan
From: Rodney Ramos [mailto:rodneyra at gmail.com]
Sent: Wednesday, August 24, 2011 9:26 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem
Hi Sven. Thank you again. I´m pretty sure that my check interval is 15 min, for both, hosts and services. I´ve set this in the templates.cfg file (see below). I sending too the nagiostats output. I agree with you that if we divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that Nagios does not make these checks smoothly during the time. Thats the problem.
==========
templates.cfg
==========
define host{
name generic-host
...
check_interval 15
....
}
define service{
name generic-service
...
normal_check_interval 15
....
}
==============
nagiostats output
==============
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org<http://www.nagios.org>)
Last Modified: 10-03-2010
License: GPL
CURRENT STATUS DATA
------------------------------------------------------
Status File: /usr/local/nagios/var/status.dat
Status File Age: 0d 0h 0m 17s
Status File Version: 3.2.3
Program Running Time: 0d 17h 43m 2s
Nagios PID: 18854
Used/High/Total Command Buffers: 0 / 0 / 4096
Total Services: 68206
Services Checked: 68206
Services Scheduled: 68206
Services Actively Checked: 68206
Services Passively Checked: 0
Total Service State Change: 0.000 / 43.880 / 2.774 %
Active Service Latency: 40.671 / 503.137 / 234.919 sec
Active Service Execution Time: 0.003 / 24.737 / 2.527 sec
Active Service State Change: 0.000 / 43.880 / 2.774 %
Active Services Last 1/5/15/60 min: 0 / 2897 / 35932 / 68206
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 46943 / 56 / 7660 / 13547
Services Flapping: 980
Services In Downtime: 0
Total Hosts: 34103
Hosts Checked: 34103
Hosts Scheduled: 34103
Hosts Actively Checked: 34103
Host Passively Checked: 0
Total Host State Change: 0.000 / 63.820 / 2.598 %
Active Host Latency: 0.000 / 474.337 / 247.944 sec
Active Host Execution Time: 0.000 / 20.354 / 2.033 sec
Active Host State Change: 0.000 / 63.820 / 2.598 %
Active Hosts Last 1/5/15/60 min: 0 / 5936 / 29437 / 34103
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 23591 / 10512 / 0
Hosts Flapping: 597
Hosts In Downtime: 0
Active Host Checks Last 1/5/15 min: 3 / 89 / 209
Scheduled: 0 / 0 / 0
On-demand: 3 / 89 / 209
Parallel: 0 / 0 / 0
Serial: 0 / 0 / 0
Cached: 3 / 89 / 209
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 0 / 0 / 0
Scheduled: 0 / 0 / 0
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
External Commands Last 1/5/15 min: 0 / 0 / 0
On Tue, Aug 23, 2011 at 6:14 PM, Sven Nierlein <Sven.Nierlein at consol.de<mailto:Sven.Nierlein at consol.de>> wrote:
On 8/23/11 22:21, Rodney Ramos wrote:
> When I´ve changed the max_concurrent_checks from "0" to "200", nagios process fell down to 30/50%. However, the latency increased a lot, going to more then 1000 sec!!
Which means you have usually more than 200 concurrent checks. Maybe 400-500. When i compare that to your inital mail, writing about 60k services + 30k hosts in a 15min interval i get only 100checks / second. Are you sure about the 15min interval? How many checks do you have per second? Did you change you interval_length?
Sven
------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management
Up to 160% more powerful than alternatives and 25% more efficient.
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net<mailto:Nagios-devel at lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/nagios-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110824/90fcf5c7/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management
Up to 160% more powerful than alternatives and 25% more efficient.
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list