Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
C. Bensend
benny at bennyvision.com
Thu May 23 17:43:49 CEST 2013
Hey folks,
I recently made two major changes to my Nagios environment:
1) I upgraded to v3.5.0.
2) I moved from a single server to two pollers sending passive
results to one central console server.
Now, this new distributed system was in place for several months
while I tested, and it worked fine. HOWEVER, since this was running
in parallel with my production system, notifications were disabled.
Hence, I didn't see this problem until I cut over for real and
enabled notifications.
(please excuse any cut-n-paste ugliness, had to send this info from
my work account via Outlook and then try to cleanse and reformat
via Squirrelmail)
As a test and to capture information, I reboot 'hostname'. This
log is from the nagios-console host, which is the host that accepts
the passive check results and sends notifications. Here is the
console host receiving a service check failure when the host is
restarting:
May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
queue;CRITICAL;SOFT;1;Connection refused by host
So, the distributed poller system checks the host and sends its
results to the console server:
May 22 15:57:30 nagios-console nagios: HOST
ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
And then the centralized server IMMEDIATELY goes into a hard state,
which triggers a notification:
May 22 15:57:30 nagios-console nagios: HOST ALERT:
hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
Host Unreachable (a.b.c.d)
Um. Wat? Why would the console immediately trigger a hard
state? The config files don't support this decision. And this
IS a problem with the console server - the distributed monitors
continue checking the host for 6 times like they should. But
for some reason, the centralized console just immediately
calls it a hard state.
Definitions on the distributed monitoring host (the one running
the actual host and service checks for this host 'hostname':
define host {
host_name hostname
alias Old production Nagios server
address a.b.c.d
action_url /pnp4nagios/graph?host=$HOSTNAME$
icon_image_alt Red Hat Linux
icon_image redhat.png
statusmap_image redhat.gd2
check_command check-host-alive
check_period 24x7
notification_period 24x7
contact_groups linux-infrastructure-admins
use linux-host-template
}
The linux-host-template on that same system:
define host {
name linux-host-template
register 0
max_check_attempts 6
check_interval 5
retry_interval 1
notification_interval 360
notification_options d,r
active_checks_enabled 1
passive_checks_enabled 1
notifications_enabled 1
check_freshness 0
check_period 24x7
notification_period 24x7
check_command check-host-alive
contact_groups linux-infrastructure-admins
}
And said command to determine up or down:
define command {
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
5000.0,80% -c 10000.0,100% -p 5
}
Definitions on the centralized console host (the one that notifies):
define host {
host_name hostname
alias Old production Nagios server
address a.b.c.d
action_url /pnp4nagios/graph?host=$HOSTNAME$
icon_image_alt Red Hat Linux
icon_image redhat.png
statusmap_image redhat.gd2
check_command check-host-alive
check_period 24x7
notification_period 24x7
contact_groups linux-infrastructure-admins
use linux-host-template,Default_monitor_server
}
The "Default monitor server" template on the centralized server:
define host {
name Default_monitor_server
register 0
active_checks_enabled 0
passive_checks_enabled 1
notifications_enabled 1
check_freshness 0
freshness_threshold 86400
}
And the linux-host-template template on that same centralized host:
define host {
name linux-host-template
register 0
max_check_attempts 6
check_interval 5
retry_interval 1
notification_interval 360
notification_options d,r
active_checks_enabled 1
passive_checks_enabled 1
notifications_enabled 1
check_freshness 0
check_period 24x7
notification_period 24x7
check_command check-host-alive
contact_groups linux-infrastructure-admins
}
This is causing some real problems:
1) If a single host polling cycle has a blip, it notifies
IMMEDIATELY.
2) Because it notifies immediately, it ignores host dependencies.
So, when a WAN link goes down for example, it fires off
notifications for *all* hosts at that site as fast as it can,
when it should be retrying, and then walking the dependency tree.
I do have translate_passive_host_checks=1 on the centralized
monitor, but the way I understand it, that shouldn't effect a
state going from SOFT to HARD. Am I misinterpreting this?
Another variable - I'm using NConf for the configuration management,
and it does some templating tricks to help with the distributed
monitoring setup. But, all it does is generate config files, and I
don't see any evidence in the configs as to why this would be
happening.
Any help would be greatly appreciated!
Benny
--
"The very existence of flamethrowers proves that sometime, somewhere,
someone said to themselves, 'You know, I want to set those people
over there on fire, but I'm just not close enough to get the job
done.'" -- George Carlin
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list