Host down, still doing active checks, causing multiple unwanted service failures
Marc Powell
marc at ena.com
Mon Dec 8 19:31:12 CET 2008
On Dec 8, 2008, at 11:38 AM, Toussaint OTTAVI wrote:
> Hi list,
>
> I've been investigating this problem for a while, but I couldn't
> find a good solution.
>
> * Example situation :
> Assume I have one host with 20 service checks.
>
> * Problem :
> If the host becomes DOWN, Nagios still continues to do service
> checks on this host. So, after a while, all the services will go to
> a CRITICAL state. Then, in my console, I will see :
> - 1 Host down,
> - 20 Services down
> This information is not pertinent. The only information I would see
> in such a case is the "host down". The 20 "service down"
> informations are obvious, and generate a "visual pollution" that may
> prevent to easily identify the problem.
Nagios is first and foremost a service monitor, not a host monitor.
Host monitoring is only necessary, as far as nagios is concerned, for
two reasons --
- notification supression. If the host is down, don't notify about
the services. They're still down so show them down, but don't wake
anybody up over it if they're not also responsible for the host.
- parenting/unreachable logic.
Nagios is designed to show the current state of services as accurately
as possible. This helps explain the 'why' of the behavior you are
seeing and works very well to cover the edge cases that your goal
won't catch. For example, if your host check is a ping and something
borks ICMP on your network, you would have all the services on that
host disabled and set to unknown, even though they are working just
fine. Your understanding of exactly what is impacted on that host is
now completely wrong. By artificially changing the service state, your
reporting is no longer reliable as well. You may be fine with that but
understand that your goal is opposite of what nagios is meant to do.
> * Expected behavior :
> When a host is down, I would like to :
> - See only one thing in red in the console : 1 HOST DOWN
> - Disabling all the service checks (which at this point do not have
> any chance of success)
> - Put the service into "UNKNOWN" status
This kind of methodology is just about opposite of what nagios is
designed to do. While you may be able to do it with creative event
handlers and modifications to your notification scripts, it's a square-
box-in-round-hole task. Instead of disabling the service checks, you
may be able to use adaptive monitoring to change the service
check_commands to something that always returns UNKNOWN (i.e.
check_dummy). This of course assumes that you are using regularly
scheduled host checks otherwise nagios would never check your host
state again and that you're able to glean what the current
check_command is for each service. When the host recovered, change the
check_command back to whatever it was before for each service.
--
Marc
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list