Host down, still doing active checks, causing multiple unwanted service failures

Marc Powell marc at ena.com
Mon Dec 8 19:31:12 CET 2008

Previous message: Host down, still doing active checks, causing multiple unwanted service failures
Next message: Host down, still doing active checks, causing multiple unwanted service failures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Dec 8, 2008, at 11:38 AM, Toussaint OTTAVI wrote:

> Hi list,
>
> I've been investigating this problem for a while, but I couldn't  
> find a good solution.
>
> * Example situation :
> Assume I have one host with 20 service checks.
>
> * Problem :
> If the host becomes DOWN, Nagios still continues to do service  
> checks on this host. So, after a while, all the services will go to  
> a CRITICAL state. Then, in my console, I will see :
>   - 1 Host down,
>   - 20 Services down
> This information is not pertinent. The only information I would see  
> in such a case is the "host down". The 20 "service down"  
> informations are obvious, and generate a "visual pollution" that may  
> prevent to easily identify the problem.

Nagios is first and foremost a service monitor, not a host monitor.  
Host monitoring is only necessary, as far as nagios is concerned, for  
two reasons --
	- notification supression. If the host is down, don't notify about  
the services. They're still down so show them down, but don't wake  
anybody up over it if they're not also responsible for the host.
	- parenting/unreachable logic.

Nagios is designed to show the current state of services as accurately  
as possible. This helps explain the 'why' of the behavior you are  
seeing and works very well to cover the edge cases that your goal  
won't catch. For example, if your host check is a ping and something  
borks ICMP on your network, you would have all the services on that  
host disabled and set to unknown, even though they are working just  
fine. Your understanding of exactly what is impacted on that host is  
now completely wrong. By artificially changing the service state, your  
reporting is no longer reliable as well. You may be fine with that but  
understand that your goal is opposite of what nagios is meant to do.

> * Expected behavior :
> When a host is down, I would like to :
> - See only one thing in red in the console : 1 HOST DOWN
> - Disabling all the service checks (which at this point do not have  
> any chance of success)
> - Put the service into "UNKNOWN" status

This kind of methodology is just about opposite of what nagios is  
designed to do. While you may be able to do it with creative event  
handlers and modifications to your notification scripts, it's a square- 
box-in-round-hole task. Instead of disabling the service checks, you  
may be able to use adaptive monitoring to change the service  
check_commands to something that always returns UNKNOWN (i.e.  
check_dummy). This of course assumes that you are using regularly  
scheduled host checks otherwise nagios would never check your host  
state again and that you're able to glean what the current  
check_command is for each service. When the host recovered, change the  
check_command back to whatever it was before for each service.

--
Marc

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Host down, still doing active checks, causing multiple unwanted service failures
Next message: Host down, still doing active checks, causing multiple unwanted service failures
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list