host check strangeness - odd behavior in Nagios scheduling queue
Frater, Greg J
GJFRATER at bechtel.com
Tue Jul 7 16:25:59 CEST 2009
Greetings All,
I'm seeing a problem with our host check scheduling. There are two
major issues, I can't tell if they are symptoms of the same problem or
two separate issues. I've provided the configs and information that I
know to be applicable, if there's other pertinent information please let
me know, I'm more than happy to provide it.
First Here's my Nagios config:
Single Nagios box (no distributed setup)
64-bit RHEL 5.3
Nagios 3.1.2 (I upgraded from 3.0.6 to see if that would fix the issues)
Problem 1. Some host checks are getting *stuck* in scheduling queue.
When I look at the scheduling queue these hosts are always listed with
the 'last check' time the same as it's 'next check' time. See attached
screen shot (problem 1). They typically stay at the top of the queue
for an hour or two.
Host configuration for one of them:
define host {
host_name hostxxx
alias Oracle
use
srvhost-os-2000,srvhost-physical,srvhost-oracle,srvhost-non-production,s
rvhost-all
notification_period aperture
register 1
}
Applicable Templates:
define host {
name generic-host
check_period 24x7
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notifications_enabled 1
register 0
}
define host {
name generic-pnp
action_url
/pnp/index.php?host=$HOSTNAME$'
onmouseover="get_g('$HOSTNAME$','_HOST_')" onmouseout="clear_g()"
register 0
}
define host {
name srvhost-all
alias All Servers
check_command check-nt-alive
use generic-pnp,generic-host
max_check_attempts 3
check_interval 60
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups +servers
notification_interval 240
notification_period 24x7
notification_options d,u,r
notifications_enabled 1
register 0
}
define host {
name srvhost-non-production
alias Non production servers
hostgroups +SRV_Cls-non-production
check_interval 120
retry_interval 20
passive_checks_enabled 1
contact_groups +servers
notification_interval 480
notification_period workhours
notification_options d,u,r
notifications_enabled 1
register 0
}
define host {
name srvhost-oracle
alias Oracle servers
hostgroups +SRV_app-oracle
contact_groups +oracle
register 0
}
define host {
name srvhost-physical
alias Servers that are running
on physical hardware
hostgroups +SRV_platform-physical
register 0
}
define host {
name srvhost-os-2000
alias Servers running Windows
2000 Server
hostgroups +SRV_os-win2000
check_command check-nt-alive
register 0
}
Problem 2. Many of our hosts are not running host checks, they are in
the scheduling queue but don't execute. Looking at the scheduling queue
I can see many of the hosts that have host 'last check' times from
several weeks ago. They show up in the queue but never run their host
checks (or don't seem to). These same hosts run service checks on time
without issue. Screen shot attached (problem 2).
Host config for one of the hosts not running host checks:
define host {
host_name hostxxxx
alias media server
use
srvhost-production,srvhost-physical,srvhost-os-2003,srvhost-all
register 1
}
define host {
name generic-host
check_period 24x7
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notifications_enabled 1
register 0
}
define host {
name generic-pnp
action_url
/pnp/index.php?host=$HOSTNAME$'
onmouseover="get_g('$HOSTNAME$','_HOST_')" onmouseout="clear_g()"
register 0
}
define host {
name srvhost-all
alias All Servers
check_command check-nt-alive
use generic-pnp,generic-host
max_check_attempts 3
check_interval 60
retry_interval 1
active_checks_enabled 1
passive_checks_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups +servers
notification_interval 240
notification_period 24x7
notification_options d,u,r
notifications_enabled 1
register 0
}
define host {
name srvhost-os-2003
alias Servers running Windows
2003
hostgroups +SRV_os-win2003
check_command check-nt-alive
register 0
}
define host {
name srvhost-physical
alias Servers that are running
on physical hardware
hostgroups +SRV_platform-physical
register 0
}
define host {
name srvhost-production
alias All servers in
production mode
hostgroups +SRV_Cls-production
contact_groups
+helpdesk,servers,servers-off-hours,thesolver
register 0
}
define command {
command_name check-nt-alive
command_line $USER1$/check_tcp -H
$HOSTADDRESS$ -p 135 -t 30
}
Any ideas or help is tracking this down is appreciated. I'm pretty sure
it's a bug in the code, but I suppose it's possible my configuration is
off somehow. :-)
Thanks Again,
-greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090707/db530722/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge
This is your chance to win up to $100,000 in prizes! For a limited time,
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize
details at: http://p.sf.net/sfu/blackberry
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list