Hi Daniel. In my environment I have a lot of hosts that are down for a long time. I can´t deal with this. One thing that should be clear is that I´m using gearman and mod_gearman to make the checks. I have 9 workers (virtual machines) to do the job. The central server, running Nagios 3.2.3, does not execute any plugin. The central server is physical, with 8 CPUs, 4 GB ram, running RHEL 5.4 64 bits. Thanks.<br>
<br><div class="gmail_quote">On Wed, Aug 24, 2011 at 11:37 AM, Daniel Wittenberg <span dir="ltr"><<a href="mailto:daniel.wittenberg.r0ko@statefarm.com">daniel.wittenberg.r0ko@statefarm.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div link="blue" vlink="purple" lang="EN-US">
<div>
<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);">I noticed from the output you have a high amount of unknown and critical services. Are those taking a long time to timeout? What you might try, which I know
isn’t ideal, but removing certain checks that might be failing, like just start with host checks, and when those show good, add a few more services, few more, etc. until you notice the time going through the roof again. That might help figure out where your
threshold is, and if there are certain checks that are causing issues. Is this a physical or virtual server?<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"><br>
Dan<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size: 11pt; color: rgb(31, 73, 125);"><u></u> <u></u></span></p>
<p class="MsoNormal"><b><span style="font-size: 10pt;">From:</span></b><span style="font-size: 10pt;"> Rodney Ramos [mailto:<a href="mailto:rodneyra@gmail.com" target="_blank">rodneyra@gmail.com</a>]
<br>
<b>Sent:</b> Wednesday, August 24, 2011 9:26 AM<div class="im"><br>
<b>To:</b> Nagios Developers List<br>
<b>Subject:</b> Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem<u></u><u></u></div></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal" style="margin-bottom: 12pt;">Hi Sven. Thank you again. I´m pretty sure that my check interval is 15 min, for both, hosts and services. I´ve set this in the templates.cfg file (see below). I sending too the nagiostats output. I agree with
you that if we divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that Nagios does not make these checks smoothly during the time. Thats the problem.</p><div><div></div><div class="h5"><br>
<br>
==========<br>
templates.cfg<br>
==========<br>
define host{<br>
name generic-host<br>
...<br>
check_interval 15<br>
....<br>
}<br>
<br>
define service{<br>
name generic-service<br>
...<br>
normal_check_interval 15<br>
....<br>
}<br>
<br>
==============<br>
nagiostats output<br>
==============<br>
Nagios Stats 3.2.3<br>
Copyright (c) 2003-2008 Ethan Galstad (<a href="http://www.nagios.org" target="_blank">www.nagios.org</a>)<br>
Last Modified: 10-03-2010<br>
License: GPL<br>
<br>
CURRENT STATUS DATA<br>
------------------------------------------------------<br>
Status File: /usr/local/nagios/var/status.dat<br>
Status File Age: 0d 0h 0m 17s<br>
Status File Version: 3.2.3<br>
<br>
Program Running Time: 0d 17h 43m 2s<br>
Nagios PID: 18854<br>
Used/High/Total Command Buffers: 0 / 0 / 4096<br>
<br>
Total Services: 68206<br>
Services Checked: 68206<br>
Services Scheduled: 68206<br>
Services Actively Checked: 68206<br>
Services Passively Checked: 0<br>
Total Service State Change: 0.000 / 43.880 / 2.774 %<br>
Active Service Latency: 40.671 / 503.137 / 234.919 sec<br>
Active Service Execution Time: 0.003 / 24.737 / 2.527 sec<br>
Active Service State Change: 0.000 / 43.880 / 2.774 %<br>
Active Services Last 1/5/15/60 min: 0 / 2897 / 35932 / 68206<br>
Passive Service Latency: 0.000 / 0.000 / 0.000 sec<br>
Passive Service State Change: 0.000 / 0.000 / 0.000 %<br>
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0<br>
Services Ok/Warn/Unk/Crit: 46943 / 56 / 7660 / 13547<br>
Services Flapping: 980<br>
Services In Downtime: 0<br>
<br>
Total Hosts: 34103<br>
Hosts Checked: 34103<br>
Hosts Scheduled: 34103<br>
Hosts Actively Checked: 34103<br>
Host Passively Checked: 0<br>
Total Host State Change: 0.000 / 63.820 / 2.598 %<br>
Active Host Latency: 0.000 / 474.337 / 247.944 sec<br>
Active Host Execution Time: 0.000 / 20.354 / 2.033 sec<br>
Active Host State Change: 0.000 / 63.820 / 2.598 %<br>
Active Hosts Last 1/5/15/60 min: 0 / 5936 / 29437 / 34103<br>
Passive Host Latency: 0.000 / 0.000 / 0.000 sec<br>
Passive Host State Change: 0.000 / 0.000 / 0.000 %<br>
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0<br>
Hosts Up/Down/Unreach: 23591 / 10512 / 0<br>
Hosts Flapping: 597<br>
Hosts In Downtime: 0<br>
<br>
Active Host Checks Last 1/5/15 min: 3 / 89 / 209<br>
Scheduled: 0 / 0 / 0<br>
On-demand: 3 / 89 / 209<br>
Parallel: 0 / 0 / 0<br>
Serial: 0 / 0 / 0<br>
Cached: 3 / 89 / 209<br>
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0<br>
Active Service Checks Last 1/5/15 min: 0 / 0 / 0<br>
Scheduled: 0 / 0 / 0<br>
On-demand: 0 / 0 / 0<br>
Cached: 0 / 0 / 0<br>
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0<br>
<br>
External Commands Last 1/5/15 min: 0 / 0 / 0<u></u><u></u></div></div><div><div></div><div class="h5">
<div>
<p class="MsoNormal">On Tue, Aug 23, 2011 at 6:14 PM, Sven Nierlein <<a href="mailto:Sven.Nierlein@consol.de" target="_blank">Sven.Nierlein@consol.de</a>> wrote:<u></u><u></u></p>
<div>
<p class="MsoNormal" style="margin-bottom: 12pt;">On 8/23/11 22:21, Rodney Ramos wrote:<br>
> When I´ve changed the max_concurrent_checks from "0" to "200", nagios process fell down to 30/50%. However, the latency increased a lot, going to more then 1000 sec!!<u></u><u></u></p>
</div>
<p class="MsoNormal">Which means you have usually more than 200 concurrent checks. Maybe 400-500. When i compare that to your inital mail, writing about 60k services + 30k hosts in a 15min interval i get only 100checks / second. Are you sure about the 15min
interval? How many checks do you have per second? Did you change you interval_length?<br>
<br>
Sven<u></u><u></u></p>
<div>
<p class="MsoNormal"><br>
------------------------------------------------------------------------------<br>
EMC VNX: the world's simplest storage, starting under $10K<br>
The only unified storage solution that offers unified management<u></u><u></u></p>
</div>
<p class="MsoNormal">Up to 160% more powerful than alternatives and 25% more efficient.<br>
Guaranteed. <a href="http://p.sf.net/sfu/emc-vnx-dev2dev" target="_blank">http://p.sf.net/sfu/emc-vnx-dev2dev</a><u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">_______________________________________________<br>
Nagios-devel mailing list<br>
<a href="mailto:Nagios-devel@lists.sourceforge.net" target="_blank">Nagios-devel@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/nagios-devel" target="_blank">https://lists.sourceforge.net/lists/listinfo/nagios-devel</a><u></u><u></u></p>
</div>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
</div></div></div>
</div>
<br>------------------------------------------------------------------------------<br>
EMC VNX: the world's simplest storage, starting under $10K<br>
The only unified storage solution that offers unified management<br>
Up to 160% more powerful than alternatives and 25% more efficient.<br>
Guaranteed. <a href="http://p.sf.net/sfu/emc-vnx-dev2dev" target="_blank">http://p.sf.net/sfu/emc-vnx-dev2dev</a><br>_______________________________________________<br>
Nagios-devel mailing list<br>
<a href="mailto:Nagios-devel@lists.sourceforge.net">Nagios-devel@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/nagios-devel" target="_blank">https://lists.sourceforge.net/lists/listinfo/nagios-devel</a><br>
<br></blockquote></div><br>