Nagios v3.5.0 transitioning immediately to a HARD state upon host problem
Doug Eubanks
admin at dougware.net
Thu May 23 21:04:52 CEST 2013
I ran into a similar problem, because my template set the service to "*
is_volatile=1*".
http://nagios.sourceforge.net/docs/3_0/volatileservices.html
Check to see if you have this flag enabled.
Doug
Sincerely,
Doug Eubanks
admin at dougware.net
K1DUG
(919) 201-8750
On Thu, May 23, 2013 at 11:43 AM, C. Bensend <benny at bennyvision.com> wrote:
>
> Hey folks,
>
> I recently made two major changes to my Nagios environment:
>
> 1) I upgraded to v3.5.0.
> 2) I moved from a single server to two pollers sending passive
> results to one central console server.
>
> Now, this new distributed system was in place for several months
> while I tested, and it worked fine. HOWEVER, since this was running
> in parallel with my production system, notifications were disabled.
> Hence, I didn't see this problem until I cut over for real and
> enabled notifications.
>
> (please excuse any cut-n-paste ugliness, had to send this info from
> my work account via Outlook and then try to cleanse and reformat
> via Squirrelmail)
>
> As a test and to capture information, I reboot 'hostname'. This
> log is from the nagios-console host, which is the host that accepts
> the passive check results and sends notifications. Here is the
> console host receiving a service check failure when the host is
> restarting:
>
> May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
> queue;CRITICAL;SOFT;1;Connection refused by host
>
>
> So, the distributed poller system checks the host and sends its
> results to the console server:
>
> May 22 15:57:30 nagios-console nagios: HOST
> ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
>
>
> And then the centralized server IMMEDIATELY goes into a hard state,
> which triggers a notification:
>
> May 22 15:57:30 nagios-console nagios: HOST ALERT:
> hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
> May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
> cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
> Host Unreachable (a.b.c.d)
>
>
> Um. Wat? Why would the console immediately trigger a hard
> state? The config files don't support this decision. And this
> IS a problem with the console server - the distributed monitors
> continue checking the host for 6 times like they should. But
> for some reason, the centralized console just immediately
> calls it a hard state.
>
> Definitions on the distributed monitoring host (the one running
> the actual host and service checks for this host 'hostname':
>
> define host {
> host_name hostname
> alias Old production Nagios server
> address a.b.c.d
> action_url /pnp4nagios/graph?host=$HOSTNAME$
> icon_image_alt Red Hat Linux
> icon_image redhat.png
> statusmap_image redhat.gd2
> check_command check-host-alive
> check_period 24x7
> notification_period 24x7
> contact_groups linux-infrastructure-admins
> use linux-host-template
> }
>
> The linux-host-template on that same system:
>
> define host {
> name linux-host-template
> register 0
> max_check_attempts 6
> check_interval 5
> retry_interval 1
> notification_interval 360
> notification_options d,r
> active_checks_enabled 1
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> check_period 24x7
> notification_period 24x7
> check_command check-host-alive
> contact_groups linux-infrastructure-admins
> }
>
> And said command to determine up or down:
>
> define command {
> command_name check-host-alive
> command_line $USER1$/check_ping -H $HOSTADDRESS$ -w
> 5000.0,80% -c 10000.0,100% -p 5
> }
>
>
> Definitions on the centralized console host (the one that notifies):
>
> define host {
> host_name hostname
> alias Old production Nagios server
> address a.b.c.d
> action_url /pnp4nagios/graph?host=$HOSTNAME$
> icon_image_alt Red Hat Linux
> icon_image redhat.png
> statusmap_image redhat.gd2
> check_command check-host-alive
> check_period 24x7
> notification_period 24x7
> contact_groups linux-infrastructure-admins
> use linux-host-template,Default_monitor_server
> }
>
> The "Default monitor server" template on the centralized server:
>
> define host {
> name Default_monitor_server
> register 0
> active_checks_enabled 0
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> freshness_threshold 86400
> }
>
> And the linux-host-template template on that same centralized host:
>
> define host {
> name linux-host-template
> register 0
> max_check_attempts 6
> check_interval 5
> retry_interval 1
> notification_interval 360
> notification_options d,r
> active_checks_enabled 1
> passive_checks_enabled 1
> notifications_enabled 1
> check_freshness 0
> check_period 24x7
> notification_period 24x7
> check_command check-host-alive
> contact_groups linux-infrastructure-admins
> }
>
>
> This is causing some real problems:
>
> 1) If a single host polling cycle has a blip, it notifies
> IMMEDIATELY.
> 2) Because it notifies immediately, it ignores host dependencies.
> So, when a WAN link goes down for example, it fires off
> notifications for *all* hosts at that site as fast as it can,
> when it should be retrying, and then walking the dependency tree.
>
> I do have translate_passive_host_checks=1 on the centralized
> monitor, but the way I understand it, that shouldn't effect a
> state going from SOFT to HARD. Am I misinterpreting this?
>
> Another variable - I'm using NConf for the configuration management,
> and it does some templating tricks to help with the distributed
> monitoring setup. But, all it does is generate config files, and I
> don't see any evidence in the configs as to why this would be
> happening.
>
> Any help would be greatly appreciated!
>
> Benny
>
>
> --
> "The very existence of flamethrowers proves that sometime, somewhere,
> someone said to themselves, 'You know, I want to set those people
> over there on fire, but I'm just not close enough to get the job
> done.'" -- George Carlin
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20130523/aa0d3c6d/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list