Latency problem - was - Nagios Performance Data shows checksaren't being completed
Serveur-Faucon Surveillance
SrvFaucon at cslaval.qc.ca
Tue Mar 7 16:52:06 CET 2006
And the weirdest part, is that it is only Netware and Windows clients that are showing this kind of performance problem.
Netware are check with the client program mrtgext.nlm
and Windows are check with the nsclient program.
Everything else is working fine, switchs, linux (nrpe), ups, etc.
---------------------------------------------------
Alexandre Racine - Gardien Virtuel - Sécurité Informatique www.gardienvirtuel.com
Montréal, Québec, Canada
>>> "Serveur-Faucon Surveillance" <SrvFaucon at cslaval.qc.ca> 2006-03-07 10:01:11 >>>
Diddo here too.
I thought that it there was too much process at once, so I changed this line in nagios.cfg...
max_service_check_spread=50
But no changes. I have 1300 services checks, so there was about 900 process simultaniously at first. But even changing the max_service_check did not change a thing. I'll go on with some tests and tweaks.
---------------------------------------------------
Alexandre Racine - Gardien Virtuel - Sécurité Informatique www.gardienvirtuel.com
Montréal, Québec, Canada
>>> kate.harris at gmail.com 2006-03-07 06:32:48 >>>
I had a similar problem and thought I had fixed it.
My situation is that I have 922 services to check (at the moment, I need to
ramp up to over 2,500 but the latency problem is a show-stopper at the
moment). I'm using a very low-spec Dell running Solaris 10 with
Nagios 2.0to do it. Using default settings, I was initially getting
average check
latencies of the order of 5-6 seconds which was fine, but after a day or so
of no Nagios restarts, that figure would rocket to 100 seconds and stay
there, not ever re-checking the majority of the services, with re-scheduled
check times staying in the past, until I did a nagios reload.
There was one directive which solved the stale re-check times:-
check_for_orphaned_services=1
Also, I reduced a couple of timeout values so that Nagios stopped wasting
time on checks which were bound to fail:-
service_check_timeout=30
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
Given that the load on the machine doesn't appear to go over 0.50, I've
allowed infinite concurrent services checks now, increased from 400, but
that appears to be making no difference at all. And I left the reaper
frequency at 10 seconds. So now the checks were being re-scheduled for
times in the future, and the latencies stopped running away quite so
dramatically.
This is the state of things at the moment:-
Active Service Checks: Time Frame Checks Completed
<= 1 minute: 107 (11.6%)
<= 5 minutes: 593 (64.3%)
<= 15 minutes: 922 (100.0%)
<= 1 hour: 922 (100.0%)
Since program start: 922 (100.0%)
Metric Min. Max. Average
Check Execution Time: 0.06 sec 19.70 sec 0.139 sec
Check Latency: 0.00 sec 17.19 sec 2.164 sec
Percent State Change: 0.00% 0.00% 0.00%
Passive Service Checks: Time Frame Checks Completed
<= 1 minute: 0 (0.0%)
<= 5 minutes: 0 (0.0%)
<= 15 minutes: 0 (0.0%)
<= 1 hour: 0 (0.0%)
Since program start: 0 (0.0%)
Metric Min. Max. Average
Percent State Change: 0.00% 0.00% 0.00%
Active Host Checks: Time Frame Checks Completed
<= 1 minute: 1 (0.9%)
<= 5 minutes: 4 (3.6%)
<= 15 minutes: 5 (4.5%)
<= 1 hour: 5 (4.5%)
Since program start: 11 (9.8%)
Metric Min. Max. Average
Check Execution Time: 0.02 sec 13.52 sec 0.170 sec
Check Latency: 0.00 sec 8.16 sec 0.073 sec
Percent State Change: 0.00% 0.00% 0.00%
Passive Host Checks: Time Frame Checks Completed
<= 1 minute: 0 (0.0%)
<= 5 minutes: 0 (0.0%)
<= 15 minutes: 0 (0.0%)
<= 1 hour: 0 (0.0%)
Since program start: 0 (0.0%)
Metric Min. Max. Average
Percent State Change: 0.00% 0.00% 0.00%
However, the latencies are creeping upwards again, albeit very very slowly
and at some point I think I'll have to do a reload just to get the checking
back on track again.
Has anyone got any ideas on where I should be looking to make this better?
K
--
Kate Harris
http://www.totkat.org/
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list