Host down, still doing active checks, causing multiple unwanted service failures
Toussaint OTTAVI
t.ottavi at medi.fr
Tue Dec 9 12:35:15 CET 2008
Hi Mark, thank you for your answer,
Marc Powell a écrit:
> Nagios is first and foremost a service monitor, not a host monitor.
> Host monitoring is only necessary, as far as nagios is concerned, for
> two reasons --
> - notification supression. If the host is down, don't notify about
> the services. They're still down so show them down, but don't wake
> anybody up over it if they're not also responsible for the host.
> - parenting/unreachable logic.
>
I agree with you. Parenting / unreachable logic is a very good thing.
But I think it should allow to declare a service as a child of its host.
This parent/child logic can suppress 'notifications'. I think it could
also suppress the display of inaccurate 'status' on the console window.
We do not use email notifications, because we are only 2 guys, and this
would generate too much messages. We periodically check the web console,
and we use on all our PCs small plugins for Firefox and Windows that
display in a small popup the list of errors/warnings. When a host is
down, we just get pages of errors about all service errors, when we
would like to have just one. It would be interesting for us if the
parent/child notification suppression mechanism could also suppress
these unwanted displays.
> Nagios is designed to show the current state of services as accurately
> as possible. This helps explain the 'why' of the behavior you are
> seeing and works very well to cover the edge cases that your goal
> won't catch. For example, if your host check is a ping and something
> borks ICMP on your network, you would have all the services on that
> host disabled and set to unknown, even though they are working just
> fine.
That's not what happens. Most of the monitored hosts are located on
WANs. These links, at least those from my office, are used only for
remote control and remote administration, thus they're build with cheap
technologies, not intended to be highly reliable. When a host becomes
not pingable, then it usually means the WAN link is down. The action is
usually to reboot a router, or reset a VPN tunnel. But, during this
time, there's no sense for me to send hundreds of checks through this
wan, because they will fail. And there's no need for me to know the
services are in a failed status. They may be working fine. But the
service check won't have any chance of success, because of WAN failure.
Then, what I would expect in the service status is "UNKNOWN". Same as
when a child becomes "UNREACHABLE" because of parent down
> Your understanding of exactly what is impacted on that host is
> now completely wrong. By artificially changing the service state, your
> reporting is no longer reliable as well. You may be fine with that but
> understand that your goal is opposite of what nagios is meant to do.
>
In my configuration, WAN failures occur far more often than general
crash of a host causing lots of services down. I agree with you, when
the WAN is down, my understanding of exactly what is impacted on the
host is completely wrong. Nagios says all the services are down, when it
should say, in my opinion, that it could not determone the status of the
services.
Moreover, plugins from various sources behave differently when the host
is unreachable. Some plugins return UNKNOWN, which may be the most
accurate result in such a sutuation. But some plugins return FAILED, and
also some plugins return WARNING. This adds a little bit more confusion
to the console, where it may not be easy to find the original problem.
> Instead of disabling the service checks, you
> may be able to use adaptive monitoring to change the service
> check_commands to something that always returns UNKNOWN (i.e.
> check_dummy).
I already think about that. But I would have to change every
check_command for every service. And, more complicated, I will have to
put back the contents of all the original service checks when the host
comes back. About disabling the services, there's an external command
called "DISABLE ALL SERVICE CHECKS" for a particular host, so that I can
disable all services in one go But to change service check_commands, I
would have to do that for every service, which would be very huge and
quite difficult to maintain ! Each remote server has approximately 20
service checks, some hundred services total, and this is only the
beginning, the full setup would require some thousands of checks, all of
them located over poor WAN links...
In fact, parent/child mechanism seems to be the right way to handle
hosts located over WANs or routers. In my opinion, it should be possible
to consider services as childs of their parent host. This may be a
feature request for future versions...
Following this idea, I will investigate the following :
- Hosts associated themselves with parent/child relationship according
to WAN topology (already working)
- For each host, I will create a "parent" service with only a
check_alive command
- Every other service will be a child of this parent service
I'll try right now. Comments and suggestions are welcome. Am I the only
one having this problem ?
Kind regards
--
*Toussaint OTTAVI*
*MEDI INFORMATIQUE*
*Mail:* t.ottavi at medi.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20081209/dc4804d0/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list