Detecting partial outages
Andreas Ericsson
ae at op5.se
Mon Aug 27 14:14:31 CEST 2007
David Barrett wrote:
> Is there any way to configure Nagios to detect and ignore partial outages?
>
> Specifically, I have multiple datacenters for my production service, and
> then two separate locations from which I do monitoring. It's very rare that
> any of the production datacenters goes down, but it does happen on occasion
> where one of the datacenters becomes inaccessible from only *one* of the
> monitoring stations.
>
> (In other words, the datacenter is up and running fine, and appears
> accessible by real users, but looks down to one of my monitoring stations.)
>
> Is there any way to configure Nagios to detect this sort of "partial outage"
> condition and ignore it? I only want to be notified if it's reported down
> by *both* monitoring stations.
>
If the production centers each hold a nagios server each, there's no way
you can accomplish this, so I'll assume your two nagios servers can still
communicate even when either data-center is down.
The best solution would be to have a neb-module that communicates check-
results between the two nagios-servers. When a check is about to be sent,
have the same neb-module check the status on that secondary nagios-server
and block the notification if either one reports the host as up. This way
you'll get an additional minor delay before receiving a notification, but
since you can force a check on either nagios from within the module whenever
the second server reports a failure, it should be a very minimal one.
Hacking up such a module should take about a week, assuming whoever does
the work is well-versed in C and has a decent grasp of nagios' internals.
A second option is to let an event-handler report the checkresults to the
other server and adding them to a list of some sort (database, flat-file,
whatever) and then modifying your notification script to only actually
send notifications when both the servers report something as down.
Assuming you use "notification_interval 0" for all your hosts and
services, only the server that does the last check of whatever
host/service it should report on will send a notification. This shouldn't
take much more than a day to hack up, but is less elegant. With a shared
network-capable database it shouldn't be too much trouble though.
There are more options, but those are the two elegant ones I can think of.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list