I had a similar problem and thought I had fixed it.<br><br>My situation is that I have 922 services to check (at the moment, I need to ramp up to over 2,500 but the latency problem is a show-stopper at the moment). I'm using a very low-spec Dell running Solaris 10 with Nagios
2.0 to do it. Using default settings, I was initially getting average check latencies of the order of 5-6 seconds which was fine, but after a day or so of no Nagios restarts, that figure would rocket to 100 seconds and stay there, not ever re-checking the majority of the services, with re-scheduled check times staying in the past, until I did a nagios reload.
<br><br>There was one directive which solved the stale re-check times:-<br>check_for_orphaned_services=1<br><br>Also, I reduced a couple of timeout values so that Nagios stopped wasting time on checks which were bound to fail:-
<br>service_check_timeout=30<br>host_check_timeout=30<br>event_handler_timeout=30<br>notification_timeout=30<br><br>Given that the load on the machine doesn't appear to go over 0.50, I've allowed infinite concurrent services checks now, increased from 400, but that appears to be making no difference at all. And I left the reaper frequency at 10 seconds. So now the checks were being re-scheduled for times in the future, and the latencies stopped running away quite so dramatically.
<br clear="all"><br>This is the state of things at the moment:-<br><br>Active Service Checks: Time Frame Checks Completed <br><= 1 minute: 107 (11.6%) <br><= 5 minutes: 593 (64.3%) <br><= 15 minutes: 922 (100.0%)
<br><= 1 hour: 922 (100.0%) <br>Since program start: 922 (100.0%) <br> <br> Metric Min. Max. Average <br>Check Execution Time: 0.06 sec 19.70 sec 0.139 sec <br>Check Latency: 0.00 sec 17.19 sec 2.164 sec <br>Percent State Change:
0.00% 0.00% 0.00% <br> <br> <br>Passive Service Checks: Time Frame Checks Completed <br><= 1 minute: 0 (0.0%) <br><= 5 minutes: 0 (0.0%) <br><= 15 minutes: 0 (0.0%) <br><= 1 hour: 0 (0.0%) <br>Since program start: 0 (
0.0%) <br> <br> Metric Min. Max. Average <br>Percent State Change: 0.00% 0.00% 0.00% <br> <br> <br>Active Host Checks: Time Frame Checks Completed <br><= 1 minute: 1 (0.9%) <br><= 5 minutes: 4 (3.6%) <br><= 15 minutes: 5 (
4.5%) <br><= 1 hour: 5 (4.5%) <br>Since program start: 11 (9.8%) <br> <br> Metric Min. Max. Average <br>Check Execution Time: 0.02 sec 13.52 sec 0.170 sec <br>Check Latency: 0.00 sec 8.16 sec 0.073 sec <br>Percent State Change:
0.00% 0.00% 0.00% <br> <br> <br>Passive Host Checks: Time Frame Checks Completed <br><= 1 minute: 0 (0.0%) <br><= 5 minutes: 0 (0.0%) <br><= 15 minutes: 0 (0.0%) <br><= 1 hour: 0 (0.0%) <br>Since program start: 0 (
0.0%) <br> <br> Metric Min. Max. Average <br>Percent State Change: 0.00% 0.00% 0.00% <br><br>However, the latencies are creeping upwards again, albeit very very slowly and at some point I think I'll have to do a reload just to get the checking back on track again.
<br><br>Has anyone got any ideas on where I should be looking to make this better?<br><br>K<br>-- <br>Kate Harris<br><a href="http://www.totkat.org/">http://www.totkat.org/</a>