<div dir="ltr">I ran into a similar problem, because my template set the service to "<i style="color:rgb(0,0,0);font-family:verdana,arial,serif;font-size:11px">is_volatile=1</i><span style="color:rgb(0,0,0);font-family:verdana,arial,serif;font-size:11px">".</span><div>
<div><br></div><div><a href="http://nagios.sourceforge.net/docs/3_0/volatileservices.html">http://nagios.sourceforge.net/docs/3_0/volatileservices.html</a><br></div></div><div><br></div><div style>Check to see if you have this flag enabled.</div>
<div style><br></div><div style>Doug</div></div><div class="gmail_extra"><br clear="all"><div>Sincerely,<br>Doug Eubanks<br><a href="mailto:admin@dougware.net" target="_blank">admin@dougware.net</a><br>K1DUG<br>(919) 201-8750</div>
<br><br><div class="gmail_quote">On Thu, May 23, 2013 at 11:43 AM, C. Bensend <span dir="ltr"><<a href="mailto:benny@bennyvision.com" target="_blank">benny@bennyvision.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Hey folks,<br>
<br>
I recently made two major changes to my Nagios environment:<br>
<br>
1) I upgraded to v3.5.0.<br>
2) I moved from a single server to two pollers sending passive<br>
results to one central console server.<br>
<br>
Now, this new distributed system was in place for several months<br>
while I tested, and it worked fine. HOWEVER, since this was running<br>
in parallel with my production system, notifications were disabled.<br>
Hence, I didn't see this problem until I cut over for real and<br>
enabled notifications.<br>
<br>
(please excuse any cut-n-paste ugliness, had to send this info from<br>
my work account via Outlook and then try to cleanse and reformat<br>
via Squirrelmail)<br>
<br>
As a test and to capture information, I reboot 'hostname'. This<br>
log is from the nagios-console host, which is the host that accepts<br>
the passive check results and sends notifications. Here is the<br>
console host receiving a service check failure when the host is<br>
restarting:<br>
<br>
May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk<br>
queue;CRITICAL;SOFT;1;Connection refused by host<br>
<br>
<br>
So, the distributed poller system checks the host and sends its<br>
results to the console server:<br>
<br>
May 22 15:57:30 nagios-console nagios: HOST<br>
ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)<br>
<br>
<br>
And then the centralized server IMMEDIATELY goes into a hard state,<br>
which triggers a notification:<br>
<br>
May 22 15:57:30 nagios-console nagios: HOST ALERT:<br>
hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)<br>
May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:<br>
cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -<br>
Host Unreachable (a.b.c.d)<br>
<br>
<br>
Um. Wat? Why would the console immediately trigger a hard<br>
state? The config files don't support this decision. And this<br>
IS a problem with the console server - the distributed monitors<br>
continue checking the host for 6 times like they should. But<br>
for some reason, the centralized console just immediately<br>
calls it a hard state.<br>
<br>
Definitions on the distributed monitoring host (the one running<br>
the actual host and service checks for this host 'hostname':<br>
<br>
define host {<br>
host_name hostname<br>
alias Old production Nagios server<br>
address a.b.c.d<br>
action_url /pnp4nagios/graph?host=$HOSTNAME$<br>
icon_image_alt Red Hat Linux<br>
icon_image redhat.png<br>
statusmap_image redhat.gd2<br>
check_command check-host-alive<br>
check_period 24x7<br>
notification_period 24x7<br>
contact_groups linux-infrastructure-admins<br>
use linux-host-template<br>
}<br>
<br>
The linux-host-template on that same system:<br>
<br>
define host {<br>
name linux-host-template<br>
register 0<br>
max_check_attempts 6<br>
check_interval 5<br>
retry_interval 1<br>
notification_interval 360<br>
notification_options d,r<br>
active_checks_enabled 1<br>
passive_checks_enabled 1<br>
notifications_enabled 1<br>
check_freshness 0<br>
check_period 24x7<br>
notification_period 24x7<br>
check_command check-host-alive<br>
contact_groups linux-infrastructure-admins<br>
}<br>
<br>
And said command to determine up or down:<br>
<br>
define command {<br>
command_name check-host-alive<br>
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w<br>
5000.0,80% -c 10000.0,100% -p 5<br>
}<br>
<br>
<br>
Definitions on the centralized console host (the one that notifies):<br>
<br>
define host {<br>
host_name hostname<br>
alias Old production Nagios server<br>
address a.b.c.d<br>
action_url /pnp4nagios/graph?host=$HOSTNAME$<br>
icon_image_alt Red Hat Linux<br>
icon_image redhat.png<br>
statusmap_image redhat.gd2<br>
check_command check-host-alive<br>
check_period 24x7<br>
notification_period 24x7<br>
contact_groups linux-infrastructure-admins<br>
use linux-host-template,Default_monitor_server<br>
}<br>
<br>
The "Default monitor server" template on the centralized server:<br>
<br>
define host {<br>
name Default_monitor_server<br>
register 0<br>
active_checks_enabled 0<br>
passive_checks_enabled 1<br>
notifications_enabled 1<br>
check_freshness 0<br>
freshness_threshold 86400<br>
}<br>
<br>
And the linux-host-template template on that same centralized host:<br>
<br>
define host {<br>
name linux-host-template<br>
register 0<br>
max_check_attempts 6<br>
check_interval 5<br>
retry_interval 1<br>
notification_interval 360<br>
notification_options d,r<br>
active_checks_enabled 1<br>
passive_checks_enabled 1<br>
notifications_enabled 1<br>
check_freshness 0<br>
check_period 24x7<br>
notification_period 24x7<br>
check_command check-host-alive<br>
contact_groups linux-infrastructure-admins<br>
}<br>
<br>
<br>
This is causing some real problems:<br>
<br>
1) If a single host polling cycle has a blip, it notifies<br>
IMMEDIATELY.<br>
2) Because it notifies immediately, it ignores host dependencies.<br>
So, when a WAN link goes down for example, it fires off<br>
notifications for *all* hosts at that site as fast as it can,<br>
when it should be retrying, and then walking the dependency tree.<br>
<br>
I do have translate_passive_host_checks=1 on the centralized<br>
monitor, but the way I understand it, that shouldn't effect a<br>
state going from SOFT to HARD. Am I misinterpreting this?<br>
<br>
Another variable - I'm using NConf for the configuration management,<br>
and it does some templating tricks to help with the distributed<br>
monitoring setup. But, all it does is generate config files, and I<br>
don't see any evidence in the configs as to why this would be<br>
happening.<br>
<br>
Any help would be greatly appreciated!<br>
<br>
Benny<br>
<br>
<br>
--<br>
"The very existence of flamethrowers proves that sometime, somewhere,<br>
someone said to themselves, 'You know, I want to set those people<br>
over there on fire, but I'm just not close enough to get the job<br>
done.'" -- George Carlin<br>
<br>
<br>
<br>
<br>
<br>
------------------------------------------------------------------------------<br>
Try New Relic Now & We'll Send You this Cool Shirt<br>
New Relic is the only SaaS-based application performance monitoring service<br>
that delivers powerful full stack analytics. Optimize and monitor your<br>
browser, app, & servers with just a few lines of code. Try New Relic<br>
and get this awesome Nerd Life shirt! <a href="http://p.sf.net/sfu/newrelic_d2d_may" target="_blank">http://p.sf.net/sfu/newrelic_d2d_may</a><br>
_______________________________________________<br>
Nagios-users mailing list<br>
<a href="mailto:Nagios-users@lists.sourceforge.net">Nagios-users@lists.sourceforge.net</a><br>
<a href="https://lists.sourceforge.net/lists/listinfo/nagios-users" target="_blank">https://lists.sourceforge.net/lists/listinfo/nagios-users</a><br>
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.<br>
::: Messages without supporting info will risk being sent to /dev/null<br>
</blockquote></div><br></div>