<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:"Balloon Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:"Tahoma","sans-serif";}
span.BalloonTextChar
{mso-style-name:"Balloon Text Char";
mso-style-priority:99;
mso-style-link:"Balloon Text";
font-family:"Tahoma","sans-serif";}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";}
@page WordSection1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">I don’t see anything obvious here, but that’s not always the case. Doing this kind of debugging isn’t always cut and dry, each environment is different, and
while the scheduling is complicated, that’s what gives the power. I don’t see a lot of MACRO processing which I have noticed can hurt a lot in big environments (I stripped out all but absolutely necessary ones). Another thing I saw before is if you have
a large number of service checks that have long timeouts and they are timing out, that will throw off the scheduler because it has to deal with those long delays. Maybe you could post the output of nagiostats and see if that lends any info? It sounds like
the core daemon is busy doing something and schedules are getting pushed out, so it’s a matter of finding what it’s busy doing. I’ve also used strace in those cases too to watch/debug what it’s doing, but that can be a lot of data very fast.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">Dan<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> Rodney Ramos [mailto:rodneyra@gmail.com]
<br>
<b>Sent:</b> Tuesday, August 23, 2011 3:22 PM<br>
<b>To:</b> Nagios Developers List<br>
<b>Subject:</b> Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem<o:p></o:p></span></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal" style="margin-bottom:12.0pt">Hi, everybody. Sorry for taking so long to reply, but I was testing what was suggested.<br>
<br>
Well, I put all files (status.dat, checkresults, nagios.tmp, nagios.log etc) on a ram disk (/dev/shm). I also disabled all brokers module, leaving only the mod_gearman broker, of course. I disabled flapping detection, performance processing, everything.
<br>
<br>
The result: absolutely nothing. No improvement. Nagios still stays with 100% of CPU. Latency is still big, beteween 250 to 500 sec.<br>
<br>
I´ve also tested the parameters "max_concurrent_checks", "check_result_reaper_frequency" and "max_check_result_reaper_time".<br>
<br>
When I´ve changed the max_concurrent_checks from "0" to "200", nagios process fell down to 30/50%. However, the latency increased a lot, going to more then 1000 sec!!<br>
<br>
I´ve changed the "check_result_reaper_frequency" and "max_check_result_reaper_time". The first from 10 to 5 s. The second from 30 to 15 sec. No big difference.<br>
<br>
I´ve enabled the nagios debug too. I had to increase the debug file size as it get full very very fast. You can see some lines below.<br>
<br>
The conclusion: I think that Nagios is not able to make active checks to so much hosts and services. It is a limitation of the tool. It has to make so much processing like scheduling and rescheduling that all the active checks get delayed. And it is not gearman
fault. On the contrary, gearman and mod_gearman make their jobs very well.<br>
<br>
But, as Daniel said, there is one thing that I can´t understand. Why my idle CPU is with 87%? It´s very weird. Is there something that makes the performance better? A Nagios or Operation System parameter?<br>
<br>
Thank you very much.<br>
<br>
===================<br>
Debug output:<br>
===================<br>
[1314129294.322456] [032.0] [pid=31793] ** Service Notification Attempt ** Host: '139874', Service: 'Memoria', Type: 0, Options: 0, Current State: 2, Last Notification: Wed Dec 31 21:00:00 1969<br>
[1314129294.322461] [001.0] [pid=31793] check_service_notification_viability()<br>
[1314129294.322464] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.322469] [032.1] [pid=31793] Notifications are temporarily disabled for this service, so we won't send one out.<br>
[1314129294.322473] [032.0] [pid=31793] Notification viability test failed. No notification will be sent out.<br>
[1314129294.322477] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:07:56 2011<br>
[1314129294.322481] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.322484] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.322493] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.322498] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'Memoria' on host 'mi139874' @ Tue Aug 23 17:07:56 2011<br>
[1314129294.337171] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.337193] [001.0] [pid=31793] add_event()<br>
[1314129294.337590] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.337598] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.337605] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.337610] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.337630] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.337652] [016.1] [pid=31793] Handling check result for service 'Memoria' on host '167077'...<br>
[1314129294.337656] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.337659] [016.0] [pid=31793] ** Handling check result for service 'Memoria' on host 'mi167077'...<br>
[1314129294.337662] [016.1] [pid=31793] HOST: mi167077, SERVICE: Memoria, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: physical memory: Total: 3.49G - Used: 914M (25%) - Free: 2.6G (75%)|'physical
memory'=25%;90;95; \n<br>
[1314129294.337693] [016.1] [pid=31793] Service is OK.<br>
[1314129294.337697] [016.1] [pid=31793] Service did not change state.<br>
[1314129294.337707] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:08:06 2011<br>
[1314129294.337710] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.337714] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.337724] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.337728] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'Memoria' on host '167077' @ Tue Aug 23 17:08:06 2011<br>
[1314129294.352397] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.352418] [001.0] [pid=31793] add_event()<br>
[1314129294.352603] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.352610] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.352616] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.352622] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.352625] [001.0] [pid=31793] check_for_service_flapping()<br>
[1314129294.352629] [016.1] [pid=31793] Checking service 'Memoria' on host '167077' for flapping...<br>
[1314129294.352633] [001.0] [pid=31793] check_for_host_flapping()<br>
[1314129294.352637] [016.1] [pid=31793] Checking host '167077' for flapping...<br>
[1314129294.352658] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.352679] [016.1] [pid=31793] Handling check result for service 'CPU' on host 'mi139447'...<br>
[1314129294.352683] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.352686] [016.0] [pid=31793] ** Handling check result for service 'CPU' on host '139447'...<br>
[1314129294.352689] [016.1] [pid=31793] HOST: 139447, SERVICE: CPU, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CHECK_NRPE: Socket timeout after 10 seconds.\n<br>
[1314129294.352702] [016.1] [pid=31793] Service is in a non-OK state!<br>
[1314129294.352706] [016.1] [pid=31793] Host is currently DOWN/UNREACHABLE.<br>
[1314129294.352709] [016.1] [pid=31793] Assuming host is in same state as before...<br>
[1314129294.352720] [032.0] [pid=31793] ** Host Notification Attempt ** Host: '139447', Type: 0, Options: 0, Current State: 1, Last Notification: Wed Dec 31 21:00:00 1969<br>
[1314129294.352725] [001.0] [pid=31793] check_host_notification_viability()<br>
[1314129294.352728] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.352738] [032.1] [pid=31793] Notifications are temporarily disabled for this host, so we won't send one out.<br>
[1314129294.352742] [032.0] [pid=31793] Notification viability test failed. No notification will be sent out.<br>
[1314129294.352745] [016.1] [pid=31793] Current/Max Attempt(s): 1/4<br>
[1314129294.352748] [016.1] [pid=31793] Host isn't UP, so we won't retry the service check...<br>
[1314129294.352762] [001.0] [pid=31793] process_macros()<br>
[1314129294.352766] [2048.1] [pid=31793] **** BEGIN MACRO PROCESSING ***********<br>
[1314129294.352769] [2048.1] [pid=31793] Processing: 'SERVICE ALERT: mi139447;CPU;$SERVICESTATE$;$SERVICESTATETYPE$;$SERVICEATTEMPT$;CHECK_NRPE: Socket timeout after 10 seconds.<br>
'<br>
[1314129294.352781] [2048.1] [pid=31793] Done. Final output: 'SERVICE ALERT: mi139447;CPU;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.<br>
'<br>
[1314129294.352785] [2048.1] [pid=31793] **** END MACRO PROCESSING *************<br>
[1314129294.352831] [064.1] [pid=31793] Making callbacks (type 9)...<br>
[1314129294.352838] [001.0] [pid=31793] handle_service_event()<br>
[1314129294.352841] [064.1] [pid=31793] Making callbacks (type 30)...<br>
[1314129294.352848] [001.0] [pid=31793] run_global_service_event_handler()<br>
[1314129294.352852] [001.0] [pid=31793] check_for_external_commands()<br>
[1314129294.352858] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:07:56 2011<br>
[1314129294.352862] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.352865] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.352871] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.352876] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'CPU' on host '139447' @ Tue Aug 23 17:07:56 2011<br>
[1314129294.367552] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.367576] [001.0] [pid=31793] add_event()<br>
[1314129294.367972] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.367979] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.367984] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.367990] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.367993] [001.0] [pid=31793] check_for_service_flapping()<br>
[1314129294.367997] [016.1] [pid=31793] Checking service 'CPU' on host '139447' for flapping...<br>
[1314129294.368001] [001.0] [pid=31793] check_for_host_flapping()<br>
[1314129294.368005] [016.1] [pid=31793] Checking host '139447' for flapping...<br>
[1314129294.368027] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.368049] [016.1] [pid=31793] Handling check result for service 'CPU' on host '139496'...<br>
[1314129294.368053] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.368057] [016.0] [pid=31793] ** Handling check result for service 'CPU' on host '139496'...<br>
[1314129294.368060] [016.1] [pid=31793] HOST: 139496, SERVICE: CPU, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CHECK_NRPE: Socket timeout after 10 seconds.\n<br>
[1314129294.368075] [016.1] [pid=31793] Service is in a non-OK state!<br>
[1314129294.368079] [016.1] [pid=31793] Host is currently DOWN/UNREACHABLE.<br>
[1314129294.368082] [016.1] [pid=31793] Assuming host is in same state as before...<br>
[1314129294.368094] [032.0] [pid=31793] ** Host Notification Attempt ** Host: 'mi139496', Type: 0, Options: 0, Current State: 1, Last Notification: Wed Dec 31 21:00:00 1969<br>
[1314129294.368098] [001.0] [pid=31793] check_host_notification_viability()<br>
[1314129294.368101] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.368111] [032.1] [pid=31793] Notifications are temporarily disabled for this host, so we won't send one out.<br>
[1314129294.368115] [032.0] [pid=31793] Notification viability test failed. No notification will be sent out.<br>
[1314129294.368118] [016.1] [pid=31793] Current/Max Attempt(s): 4/4<br>
[1314129294.368122] [016.1] [pid=31793] Service has reached max number of rechecks, so we'll handle the error...<br>
[1314129294.368125] [001.0] [pid=31793] check_for_service_flapping()<br>
[1314129294.368128] [016.1] [pid=31793] Checking service 'CPU' on host '139496' for flapping...<br>
[1314129294.368132] [001.0] [pid=31793] check_for_host_flapping()<br>
[1314129294.368135] [016.1] [pid=31793] Checking host '139496' for flapping...<br>
[1314129294.368138] [001.0] [pid=31793] service_notification()<br>
[1314129294.368144] [032.0] [pid=31793] ** Service Notification Attempt ** Host: '139496', Service: 'CPU', Type: 0, Options: 0, Current State: 2, Last Notification: Wed Dec 31 21:00:00 1969<br>
[1314129294.368148] [001.0] [pid=31793] check_service_notification_viability()<br>
[1314129294.368151] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.368156] [032.1] [pid=31793] Notifications are temporarily disabled for this service, so we won't send one out.<br>
[1314129294.368160] [032.0] [pid=31793] Notification viability test failed. No notification will be sent out.<br>
[1314129294.368165] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:07:56 2011<br>
[1314129294.368168] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.368171] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.368176] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.368181] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'CPU' on host 'mi139496' @ Tue Aug 23 17:07:56 2011<br>
[1314129294.382852] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.382875] [001.0] [pid=31793] add_event()<br>
[1314129294.383268] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.383275] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.383281] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.383286] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.383320] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.383339] [016.1] [pid=31793] Handling check result for service 'Memoria' on host '167028'...<br>
[1314129294.383343] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.383346] [016.0] [pid=31793] ** Handling check result for service 'Memoria' on host '167028'...<br>
[1314129294.383350] [016.1] [pid=31793] HOST: 167028, SERVICE: Memoria, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: physical memory: Total: 3.49G - Used: 856M (23%) - Free: 2.65G (77%)|'physical
memory'=23%;90;95; \n<br>
[1314129294.383366] [016.1] [pid=31793] Service is OK.<br>
[1314129294.383370] [016.1] [pid=31793] Service did not change state.<br>
[1314129294.383380] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:08:06 2011<br>
[1314129294.383383] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.383386] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.383396] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.383401] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'Memoria' on host 'mi167028' @ Tue Aug 23 17:08:06 2011<br>
[1314129294.398073] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.398096] [001.0] [pid=31793] add_event()<br>
[1314129294.398268] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.398275] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.398281] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.398287] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.398290] [001.0] [pid=31793] check_for_service_flapping()<br>
[1314129294.398293] [016.1] [pid=31793] Checking service 'Memoria' on host 'mi167028' for flapping...<br>
[1314129294.398298] [001.0] [pid=31793] check_for_host_flapping()<br>
[1314129294.398301] [016.1] [pid=31793] Checking host '167028' for flapping...<br>
[1314129294.398322] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.398337] [016.1] [pid=31793] Handling check result for service 'CPU' on host '166384'...<br>
[1314129294.398341] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.398345] [016.0] [pid=31793] ** Handling check result for service 'CPU' on host '166384'...<br>
[1314129294.398348] [016.1] [pid=31793] HOST: 166384, SERVICE: CPU, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: 15m: average load 2%|'15m'=2%;90;95; \n<br>
[1314129294.398363] [016.1] [pid=31793] Service is OK.<br>
[1314129294.398366] [016.1] [pid=31793] Service did not change state.<br>
[1314129294.398376] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:08:06 2011<br>
[1314129294.398379] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.398383] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.398393] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.398398] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'CPU' on host '166384' @ Tue Aug 23 17:08:06 2011<br>
[1314129294.413177] [001.0] [pid=31793] reschedule_event()<br>
[1314129294.413202] [001.0] [pid=31793] add_event()<br>
[1314129294.413373] [064.1] [pid=31793] Making callbacks (type 8)...<br>
[1314129294.413380] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.413387] [064.1] [pid=31793] Making callbacks (type 13)...<br>
[1314129294.413394] [064.1] [pid=31793] Making callbacks (type 20)...<br>
[1314129294.413397] [001.0] [pid=31793] check_for_service_flapping()<br>
[1314129294.413400] [016.1] [pid=31793] Checking service 'CPU' on host '166384' for flapping...<br>
[1314129294.413405] [001.0] [pid=31793] check_for_host_flapping()<br>
[1314129294.413409] [016.1] [pid=31793] Checking host '166384' for flapping...<br>
[1314129294.413432] [016.1] [pid=31793] Deleted check result file '(null)'<br>
[1314129294.413452] [016.1] [pid=31793] Handling check result for service 'CPU' on host '167022'...<br>
[1314129294.413455] [001.0] [pid=31793] handle_async_service_check_result()<br>
[1314129294.413459] [016.0] [pid=31793] ** Handling check result for service 'CPU' on host '167022'...<br>
[1314129294.413476] [016.1] [pid=31793] HOST: 167022, SERVICE: CPU, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK: 15m: average load 1%|'15m'=1%;90;95; \n<br>
[1314129294.413493] [016.1] [pid=31793] Service is OK.<br>
[1314129294.413497] [016.1] [pid=31793] Service did not change state.<br>
[1314129294.413506] [016.1] [pid=31793] Rescheduling next check of service at Tue Aug 23 17:08:06 2011<br>
[1314129294.413510] [001.0] [pid=31793] get_next_valid_time()<br>
[1314129294.413514] [001.0] [pid=31793] check_time_against_period()<br>
[1314129294.413523] [001.0] [pid=31793] schedule_service_check()<br>
[1314129294.413528] [016.0] [pid=31793] Scheduling a non-forced, active check of service 'CPU' on host i167022' @ Tue Aug 23 17:08:06 2011<br>
=================================================<o:p></o:p></p>
<div>
<p class="MsoNormal">On Mon, Aug 22, 2011 at 7:23 PM, Daniel Wittenberg <<a href="mailto:daniel.wittenberg.r0ko@statefarm.com">daniel.wittenberg.r0ko@statefarm.com</a>> wrote:<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D">What is interesting is your CPU is 87% idle, which indicates to me that it’s waiting for something, or not scheduling the checks correctly.
Have you tried running in debug mode to see if that indicates anything? Also running in debug on just about any of the plugins can cause this too, just in case you have logging turned up on things like nsca, nrpe, pnp4nagios, etc.</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D">Dan</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b><span style="font-size:10.0pt">From:</span></b><span style="font-size:10.0pt"> Rodney Ramos [mailto:<a href="mailto:rodneyra@gmail.com" target="_blank">rodneyra@gmail.com</a>]
<br>
<b>Sent:</b> Friday, August 19, 2011 4:44 PM<o:p></o:p></span></p>
<div>
<div>
<p class="MsoNormal"><span style="font-size:10.0pt"><br>
<b>To:</b> Nagios Developers List<br>
<b>Subject:</b> Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem<o:p></o:p></span></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;margin-bottom:12.0pt">Thanks, Daniel, but I don´t think that my problem is of hardware. I create the ramdisk and the problem is the same:
<br>
- nagios eating 100% of CPU all the time;<br>
- nagios does not distribute the active checks in a smoothly way. It waits a long time and make the acitve checks in a burst way. I can see this with the gearman_top. The gearmand jobs waiting queue is empty almost all the time, but sometimes there is a burst
of jobs in the queue. I can´t understand this behavior.<br>
<br>
Any help would be great. Thanks everybody.<br>
<br>
=========<br>
Top result<br>
=========<br>
<br>
top - 18:40:59 up 106 days, 16:56, 4 users, load average: 8.52, 6.09, 5.42<br>
Tasks: 215 total, 2 running, 213 sleeping, 0 stopped, 0 zombie<br>
Cpu(s): 12.5%us, 0.1%sy, 0.0%ni, 87.1%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st<br>
Mem: 4916356k total, 1974976k used, 2941380k free, 163240k buffers<br>
Swap: 4194296k total, 22092k used, 4172204k free, 745100k cached<br>
<br>
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND<br>
2189 nagios 25 0 492m 255m 1668 R 100.1 5.3 66:54.59 nagios<br>
24658 nagios 15 0 561m 116m 676 S 0.7 2.4 62:00.96 gearmand<br>
<br>
<br>
<o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">On Fri, Aug 19, 2011 at 1:31 PM, Daniel Wittenberg <<a href="mailto:daniel.wittenberg.r0ko@statefarm.com" target="_blank">daniel.wittenberg.r0ko@statefarm.com</a>> wrote:<o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D">Well but look at your bi and bo, and then the wa column. So looks like you have some IO Wait which probably means it’s waiting on disk
activity to get things done, and lots of writing to disk. Have you looked at adding a ramdisk for your checkresults, status.dat, and temp_file? That should help eliminate most of the heavy disk i/o from the nagios perspective. Since it doesn’t look like
you are swapping memory you should be able to throw some at a ramdisk. You can probably start with 64MB and watch it, might have to go higher depending on your workload.</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D">Dan</span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><span style="font-size:11.0pt;color:#1F497D"> </span><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b><span style="font-size:10.0pt">From:</span></b><span style="font-size:10.0pt"> Rodney Ramos [mailto:<a href="mailto:rodneyra@gmail.com" target="_blank">rodneyra@gmail.com</a>]
<br>
<b>Sent:</b> Friday, August 19, 2011 11:27 AM<br>
<b>To:</b> Nagios Developers List<br>
<b>Subject:</b> Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem</span><o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;margin-bottom:12.0pt">Hi, Daniel,<br>
<br>
As we can see below, I think it is not a hardware problem. The idle CPU is beteween 60 and 80 %, very good.<br>
<br>
Thank you very much.<br>
<br>
<br>
<span style="font-family:"Courier New"">$ vmstat 5<br>
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------<br>
r b swpd free buff cache si so bi bo in cs us sy id wa st<br>
1 2 22092 3046788 189640 890940 0 0 295 1053 0 0 4 3 83 10 0<br>
1 2 22092 3032992 189664 904600 0 0 2733 7550 3498 7477 12 1 69 18 0<br>
1 2 22092 3018240 189668 918632 0 0 2720 4070 2484 5114 13 1 72 15 0<br>
1 0 22092 3008312 189668 930336 0 0 2332 1534 1932 3825 13 1 73 14 0<br>
1 18 22092 2979292 189724 945780 0 0 1486 13974 2460 8446 16 2 72 10 0<br>
1 2 22092 2965244 189736 959228 0 0 2570 9094 3290 7204 13 1 67 19 0<br>
1 2 22092 2949064 189748 973100 0 0 2820 3040 2798 6639 13 2 68 17 0<br>
1 6 22092 2936060 189768 987788 0 0 2894 3620 2474 5443 13 1 70 16 0<br>
1 1 22092 2923320 189780 999708 0 0 2377 2618 2285 4794 13 1 70 16 0<br>
1 0 22092 2923428 189780 999964 0 0 0 4575 1732 2317 12 1 86 1 0<br>
1 9 22092 2912192 189784 1005260 0 0 402 4544 1541 3889 14 1 82 3 0<br>
1 7 22092 2891692 189808 1023020 0 0 2534 13969 3232 9421 14 2 66 17 0<br>
3 2 22092 2868908 189836 1037064 0 0 2797 4115 3002 7055 30 2 54 14 0<br>
2 2 22092 2860712 189860 1050376 0 0 2646 3352 2448 5416 16 1 67 17 0<br>
1 8 22092 2847052 189872 1064036 0 0 2748 3970 2616 5487 13 1 69 17 0<br>
1 0 22092 3469576 189876 462624 0 0 825 1245 1379 2098 12 1 83 5 0<br>
1 0 22092 3469248 189884 462720 0 0 4 2631 1552 2599 13 0 86 0 0<br>
1 20 22092 3449816 189904 482192 0 0 2404 8454 2293 7764 15 2 70 12 0<br>
1 17 22092 3434856 189912 495636 0 0 2694 8955 3542 8039 13 2 65 19 0<br>
2 7 22092 3422204 189932 509376 0 0 2742 4059 2685 5826 13 1 68 19 0<br>
1 13 22092 3407532 189948 522508 0 0 2661 3613 6447 49867 12 4 66 17 0<br>
0 0 22092 3404484 189968 525964 0 0 669 3338 5317 43602 10 4 81 6 0<br>
1 0 22092 3402004 189984 525956 0 0 0 14 3637 12700 13 1 85 0 0<br>
1 0 22092 3398172 190012 526036 0 0 0 3318 3972 12401 14 1 85 0 0<br>
2 0 22092 3392628 190028 526048 0 0 0 9331 5347 16423 15 3 81 1 0<br>
4 0 22092 3391704 190048 526060 0 0 0 4270 5785 18736 16 2 80 1 0<br>
1 1 22092 3391652 190064 526056 0 0 0 4091 4746 14669 16 2 82 1 0<br>
1 0 22092 3392104 190068 526056 0 0 0 1562 4037 11849 16 1 83 0 0<br>
3 0 22092 3392304 190084 526168 0 0 1 2532 4618 16418 15 2 83 0 0<br>
1 7 22092 3386028 190112 531488 0 0 967 363 4194 14941 15 2 77 6 0<br>
</span>On Fri, Aug 19, 2011 at 11:32 AM, Daniel Wittenberg <<a href="mailto:daniel.wittenberg.r0ko@statefarm.com" target="_blank">daniel.wittenberg.r0ko@statefarm.com</a>> wrote:<br>
><br>
> One simple thing that might help is just run vmstat for a couple minutes:<br>
><br>
> <br>
><br>
> vmstat 5<br>
><br>
> <br>
><br>
> That can help show if you are hitting some bottlenecks. Are you using a lot of macros in your configs?<br>
><br>
> <br>
><br>
> Dan<br>
><br>
> <br>
><br>
> From: Rodney Ramos [mailto:<a href="mailto:rodneyra@gmail.com" target="_blank">rodneyra@gmail.com</a>]<br>
> Sent: Friday, August 19, 2011 9:30 AM<br>
> To: Nagios Developers List<br>
> Subject: [Nagios-devel] Nagios and Gearman - huge environment performance problem<br>
><br>
> <br>
><br>
> Hi everybody,<br>
><br>
> I´m testing Nagios and Gearman / Mod_Gearman. I´d like to change NSCA with this new approach, as it seems easier to configure and has a lot of advantages. Besides, NSCA and Nagios freshness mechanism have some problems.<br>
><br>
> Gearman and mod_gearman are working well. I have 30000 hosts and 60000 services, and it is increasing!<br>
><br>
> Now I´m having problem with Nagios performance, that eats 100% of CPU and the host and service latency is very big, around 300 seconds. I think that this a Nagios problem, as the gearman_top shows the Job Wainting queue empty almost all the time. It seems
that Nagios do not send the active checks all the time, an once in while it sends a burst of active checks.<br>
><br>
> I have a physical central server, running RHEL, with 4 GB of ram, Intel(R) Xeon(R) CPU E5504 @ 2.00GHz (8 CPUs). For the workers I have 9 virtual servers running RHEL too.<br>
><br>
> I've already set the Nagios parameters to large environment, as recommended in the documentation, but it made no difference. Thanks.<br>
><br>
> Nagios Parameters to large environment:<br>
><br>
> - use_large_installation_tweaks=1<br>
><br>
> - enable_environment_macros=0<br>
><br>
> - max_concurrent_checks=0<br>
><br>
> - check_result_reaper_frequency=10<br>
><br>
> Could someone help me? How can I improve Nagios performance to make active checks faster?<br>
><br>
> Thank you very much.<br>
><br>
><br>
> ------------------------------------------------------------------------------<br>
> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,<br>
> user administration capabilities and model configuration. Take<br>
> the hassle out of deploying and managing Subversion and the<br>
> tools developers use with it. <a href="http://p.sf.net/sfu/wandisco-d2d-2" target="_blank">
http://p.sf.net/sfu/wandisco-d2d-2</a><br>
> _______________________________________________<br>
> Nagios-devel mailing list<br>
> <a href="mailto:Nagios-devel@lists.sourceforge.net" target="_blank">Nagios-devel@lists.sourceforge.net</a><br>
> <a href="https://lists.sourceforge.net/lists/listinfo/nagios-devel" target="_blank">
https://lists.sourceforge.net/lists/listinfo/nagios-devel</a><br>
><o:p></o:p></p>
</div>
</div>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;margin-bottom:12.0pt"><br>
------------------------------------------------------------------------------<br>
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,<br>
user administration capabilities and model configuration. Take<br>
the hassle out of deploying and managing Subversion and the<br>
tools developers use with it. <a href="http://p.sf.net/sfu/wandisco-d2d-2" target="_blank">
http://p.sf.net/sfu/wandisco-d2d-2</a><br>
_______________________________________________<br>
Nagios-devel mailing list<br>
<a href="mailto:Nagios-devel@lists.sourceforge.net" target="_blank">Nagios-devel@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/nagios-devel" target="_blank">https://lists.sourceforge.net/lists/listinfo/nagios-devel</a><o:p></o:p></p>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
</div>
</div>
</div>
<p class="MsoNormal" style="margin-bottom:12.0pt"><br>
------------------------------------------------------------------------------<br>
uberSVN's rich system and user administration capabilities and model<br>
configuration take the hassle out of deploying and managing Subversion and<br>
the tools developers use with it. Learn more about uberSVN and get a free<br>
download at: <a href="http://p.sf.net/sfu/wandisco-dev2dev" target="_blank">http://p.sf.net/sfu/wandisco-dev2dev</a><br>
<br>
_______________________________________________<br>
Nagios-devel mailing list<br>
<a href="mailto:Nagios-devel@lists.sourceforge.net">Nagios-devel@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/nagios-devel" target="_blank">https://lists.sourceforge.net/lists/listinfo/nagios-devel</a><o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</body>
</html>