Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
Andreas Ericsson
ae at op5.se
Fri May 24 09:42:37 CEST 2013
On 2013-05-23 17:43, C. Bensend wrote:
>
> Hey folks,
>
> I recently made two major changes to my Nagios environment:
>
> 1) I upgraded to v3.5.0.
> 2) I moved from a single server to two pollers sending passive
> results to one central console server.
>
> Now, this new distributed system was in place for several months
> while I tested, and it worked fine. HOWEVER, since this was running
> in parallel with my production system, notifications were disabled.
> Hence, I didn't see this problem until I cut over for real and
> enabled notifications.
>
> (please excuse any cut-n-paste ugliness, had to send this info from
> my work account via Outlook and then try to cleanse and reformat
> via Squirrelmail)
>
> As a test and to capture information, I reboot 'hostname'. This
> log is from the nagios-console host, which is the host that accepts
> the passive check results and sends notifications. Here is the
> console host receiving a service check failure when the host is
> restarting:
>
> May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
> queue;CRITICAL;SOFT;1;Connection refused by host
>
>
> So, the distributed poller system checks the host and sends its
> results to the console server:
>
> May 22 15:57:30 nagios-console nagios: HOST
> ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
>
>
> And then the centralized server IMMEDIATELY goes into a hard state,
> which triggers a notification:
>
> May 22 15:57:30 nagios-console nagios: HOST ALERT:
> hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
> May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
> cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
> Host Unreachable (a.b.c.d)
>
>
> Um. Wat? Why would the console immediately trigger a hard
> state? The config files don't support this decision. And this
> IS a problem with the console server - the distributed monitors
> continue checking the host for 6 times like they should. But
> for some reason, the centralized console just immediately
> calls it a hard state.
>
> Definitions on the distributed monitoring host (the one running
> the actual host and service checks for this host 'hostname':
>
> define host {
> host_name hostname
> alias Old production Nagios server
> address a.b.c.d
> action_url /pnp4nagios/graph?host=$HOSTNAME$
> icon_image_alt Red Hat Linux
> icon_image redhat.png
> statusmap_image redhat.gd2
> check_command check-host-alive
> check_period 24x7
> notification_period 24x7
> contact_groups linux-infrastructure-admins
> use linux-host-template
> }
>
> The linux-host-template on that same system:
>
> define host {
> name linux-host-template
> register 0
> max_check_attempts 6
> check_interval 5
> retry_interval 1
> notification_interval 360
> notification_options d,r
> active_checks_enabled 1
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> check_period 24x7
> notification_period 24x7
> check_command check-host-alive
> contact_groups linux-infrastructure-admins
> }
>
> And said command to determine up or down:
>
> define command {
> command_name check-host-alive
> command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
> 5000.0,80% -c 10000.0,100% -p 5
> }
>
>
> Definitions on the centralized console host (the one that notifies):
>
> define host {
> host_name hostname
> alias Old production Nagios server
> address a.b.c.d
> action_url /pnp4nagios/graph?host=$HOSTNAME$
> icon_image_alt Red Hat Linux
> icon_image redhat.png
> statusmap_image redhat.gd2
> check_command check-host-alive
> check_period 24x7
> notification_period 24x7
> contact_groups linux-infrastructure-admins
> use linux-host-template,Default_monitor_server
> }
>
> The "Default monitor server" template on the centralized server:
>
> define host {
> name Default_monitor_server
> register 0
> active_checks_enabled 0
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> freshness_threshold 86400
> }
>
> And the linux-host-template template on that same centralized host:
>
> define host {
> name linux-host-template
> register 0
> max_check_attempts 6
> check_interval 5
> retry_interval 1
> notification_interval 360
> notification_options d,r
> active_checks_enabled 1
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> check_period 24x7
> notification_period 24x7
> check_command check-host-alive
> contact_groups linux-infrastructure-admins
> }
>
>
> This is causing some real problems:
>
> 1) If a single host polling cycle has a blip, it notifies
> IMMEDIATELY.
> 2) Because it notifies immediately, it ignores host dependencies.
> So, when a WAN link goes down for example, it fires off
> notifications for *all* hosts at that site as fast as it can,
> when it should be retrying, and then walking the dependency tree.
>
> I do have translate_passive_host_checks=1 on the centralized
> monitor, but the way I understand it, that shouldn't effect a
> state going from SOFT to HARD. Am I misinterpreting this?
>
> Another variable - I'm using NConf for the configuration management,
> and it does some templating tricks to help with the distributed
> monitoring setup. But, all it does is generate config files, and I
> don't see any evidence in the configs as to why this would be
> happening.
>
> Any help would be greatly appreciated!
>
Set passive_host_checks_are_soft=1 in nagios.cfg on your master
server and things should start working as intended.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list