Confused on Nagios check queue processing of down host
Frater, Greg J
gjfrater at bechtel.com
Mon Nov 25 17:39:39 CET 2002
Hello All,
I'm running Nagios on a server (Compaq dual proc, 1.4 Gig, 512 RAM, RAID 5)
system with the expectation of checking 650 hosts using 1500-2000 service
checks. Running Nagios 1.0b6 on RH 7.3 (kernel 2.4-18). When putting the
initial set of checks on the server I noticed very large check latency.
After digging into the problem I found that it is caused by one (or more)
down host getting stuck in the scheduling queue. When a service check for a
down host gets to the top of the scheduling queue it gets stuck causing a
backlog in the queue. It sits at the top of the queue for about 5 minutes
give or take 30 sec. With a 5 minute late start Nagios may or may not ever
catch up depending on the number of checks being done. Even though it is
running parallelized checks it stops all of them (according to the cgi)
until that 5 minute time is reached then the down host service check clears
the queue and it continues processing the other checks. From the mailing
list archive it looks like others are having similar problems showing up as
high check latency. The way I read the documentation it appears that this
should be prevented by the service_check_timeout and host_check_timeout.
Surely this hang up is not by design. Would this be considered a bug or
could I have things misconfigured? Below are my configs, let me know if I
left something out that could help figure this out. I appreciate any help
or suggestions in fixing this problem.
At the time the down host hits the queue my vitals looked like the
following:
Time Frame Checks Completed
<= 1 minute: 41 (15.5%)
<= 5 minutes: 252 (95.5%)
<= 15 minutes: 264 (100.0%)
<= 1 hour: 264 (100.0%)
Since program start: 264 (100.0%)
Metric Min. Max. Average
Check Execution Time: 2 sec 6 sec 2.648 sec
Check Latency: < 1 sec 1 sec 0.004 sec
Process Status: OK
Check Command Output: Nagios ok: located 5 processes, status log updated 9
seconds ago
At about the 5 minute mark it looks like this:
Time Frame Checks Completed
<= 1 minute: 26 (9.8%)
<= 5 minutes: 26 (9.8%)
<= 15 minutes: 264 (100.0%)
<= 1 hour: 264 (100.0%)
Since program start: 264 (100.0%)
Metric Min. Max. Average
Check Execution Time: 2 sec 10 sec 2.678 sec
Check Latency: 4 sec 301 sec 152.087 sec
Percent State Change: 0.00% 6.12% 0.02%
Process Status: WARNING
Check Command Output: Nagios problem: located 4 processes, status log
updated 309 seconds ago
This entire time there are no changes in the scheduling queue
nagios.cfg:
#global_service_event_handler=somecommand
# n = None - don't use any delay between checks
# d = Use a "dumb" delay of 1 second between checks
# s = Use "smart" inter-check delay calculation
# x.xx = Use an inter-check delay of x.xx seconds
inter_check_delay_method=s
# s = Use "smart" interleave factor calculation
# x = Use an interleave factor of x, where x is a
# number greater than or equal to 1.
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10
sleep_time=1
service_check_timeout=20
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=0
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=60
use_retained_program_state=1
interval_length=60
use_agressive_host_checking=0
checkcommands.cfg:
# 'check_ping' command definition
define command{
command_name check_ping
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c
$ARG2$ -p $ARG3$
}
# 'check_host_alive' command definition
define command{
command_name check-nt-alive
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 135
}
# 'check_cisco_alive' command definition
define command{
command_name check-cisco-alive
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 23
}
services.cfg:
define service{
name ping-templ
service_description PING
is_volatile 0
check_command check_ping!100.0,60%!500.0,100%!3
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
active_checks_enabled 1
passive_checks_enabled 0
check_period 24x7
obsess_over_service 1
check_freshness 0
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
notifications_enabled 1
stalking_options w
register 0
}
# Ping Servers definition
define service{
use ping-templ ; Name of
service template to use
host_name
SRV0001,SRV0002,SRV0003,SRV0004,SRV0005,SRV0006,SRV0007,SRV0009,SRV0010,SRV0
011,SRV0012,SRV0013,SRV0014,SRV0015,SRV0016,SRV0017,SRV0018,SRV0019,SRV0020,
SRV0021,SRV0022,SRV0023,SRV0024,SRV0025,SRV0026,SRV0027,SRV0028,SRV0029,SRV0
030,SRV0031,SRV0032,SRV0033,SRV0034,SRV0035,SRV0036,SRV0037,SRV0038,SRV0039,
SRV0040,SRV0041,SRV0042,SRV0043,SRV0044,SRV0045,SRV0046,SRV0047,SRV0048,SRV0
049,SRV0050,SRV0051,SRV0052,SRV0053,SRV0054,SRV0055,SRV0056,SRV0057,SRV0058,
SRV0059,SRV0060,SRV0061,SRV0062,SRV0063,SRV0064,SRV0065,SRV0066,SRV0068,SRV0
069,SRV0070,SRV0071,SRV0072,SRV0073,SRV0074,SRV0075,SRV0076,SRV0077,SRV0078,
SRV0079,SRV0080,SRV0081,SRV0082,SRV0083,SRV0084,SRV0085,SRV0086,SRV0087,SRV0
088,SRV0089,SRV0090,SRV0091,SRV0092,SRV0093,SRV0094,SRV0095,SRV0096,SRV0098,
SRV0099,SRV0100,SRV0102,SRV0103,SRV0104,SRV0105,SRV0106,WTPS16193
contact_groups nt-admins
}
Thanks,
Greg Frater
WTP IT dept.
509 371 3537
gjfrater at bechtel.com
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
More information about the Users
mailing list