Escalations and latency
andy at droidmcse.com
andy at droidmcse.com
Wed Nov 6 19:44:51 CET 2002
Hey Gang!
Ethan - Awesome program. Great work!
I have a dual 1.4ghz Compaq Server w/ 2gigs of memory. The sole purpose
of this box is to run Nagios.
Here is my observation and I'm hoping that someone can offer up a solution
or at least an explanation of why this is the way it is:
I have about 450 hosts and little over 1300 service checks occurring. 75%
of those are a standard set of NT checks - cpu load, memory, disk space,
etc. Outside of those checks, I'm doing specific checks for things like
web servers. Notice I have contact_group set to nt-admins-tier1.
define service{
use generic-service ; Name of
service template to use
host_name usmnli05,usmnli06
service_description Promo Planning
max_check_attempts 2
retry_check_interval 1
contact_groups nt-admins-tier1
notification_options w,u,c,r
check_command "$USER1$/check_http -H
$HOSTADDRESS$ -u /promoplanning/asp/home.asp"
}
define serviceescalation{
host_name usmnli05,usmnli06
service_description Promo Planning
contact_groups nt-admins-tier1
first_notification 1
last_notification 4
notification_interval 30
}
define serviceescalation{
host_name usmnli05,usmnli06
service_description Promo Planning
contact_groups nt-admins-tier2
first_notification 2
last_notification 4
notification_interval 30
}
define serviceescalation{
host_name usmnli05,usmnli06
service_description Promo Planning
contact_groups nt-admins-tier3
first_notification 3
last_notification 3
notification_interval 30
}
define serviceescalation{
host_name usmnli05,usmnli06
service_description Promo Planning
contact_groups nt-admins-tier1
first_notification 4
last_notification 4
notification_interval 30
}
My intention is to notify the nt tier structure at 30 minute intervals 4
times. On the 3rd attempt, I generate an email message that logs a help
desk ticket. This is a cover my a** attempt. If sendpage has died at
least I am pushing off the job on our help desk to get the ticket logged
and resolved.
Now that I think I've painted a semi-clear picture of my intentions, here
is my problem:
[root at mnmslx11 etc]# ../bin/nagios -s ./nagios.cfg
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 1314
Total hosts: 452
Rough guidelines for max_concurrent_checks value:
-------------------------------------------------
Absolute minimum value: 12
Recommend value: 36
According to this information, I only need to execute 36 checks
simultaneously to get all of my checks done.
Immediately after I start nagios, and it starts running the checks, my
latency increases to a specific point that appears to hold steady. It
works it way steadily up to Min:120, Max:355,Avg:250. And it holds steady
there.
I just checked the service_check_timeout. I changed it from 60 down to 20
and it has helped dramatically. I have also changed the
service_reaper_frequency from 10 to 5. However, my box is still pushing
the max number of checks which I have set a 400.
What this is telling me is that the service checks aren't getting dumped
when they finish. They are taking an average of .8 seconds to complete,
but they are not going away to make room for the next check. If I'm
utilizing my very poor math skills correctly, .8 seconds and 400 checks at
a time - I should be able to complete 400 checks (approx) in 1 minute.
With the vast majority of my checks being 5 minute intervals, there should
be plenty of breathing room to complete the 1300 checks in a 5 minute
window.
Someone - please chime in and offer up some advice. Any suggestions would
be greatly appreciated.
Thanks!
Andy
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
More information about the Users
mailing list