Detecting partial outages

Andrew Cruse andrew at profitability.net
Fri Aug 24 15:39:12 CEST 2007

Previous message: Detecting partial outages
Next message: Detecting partial outages
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

David Barrett wrote:
> Is there any way to configure Nagios to detect and ignore partial
> outages? 
> 
> Specifically, I have multiple datacenters for my production service,
> and then two separate locations from which I do monitoring.  It's
> very rare that any of the production datacenters goes down, but it
> does happen on occasion where one of the datacenters becomes
> inaccessible from only *one* of the monitoring stations.
> 
> (In other words, the datacenter is up and running fine, and appears
> accessible by real users, but looks down to one of my monitoring
> stations.) 
> 
> Is there any way to configure Nagios to detect this sort of "partial
> outage" condition and ignore it?  I only want to be notified if it's
> reported down by *both* monitoring stations.

This is a great question, I was hoping others would respond with ideas.
I've been looking to do something similar myself, but haven't implemented
anything yet.  I have a few half-baked ideas I've been tossing around:

1.  Set up NRPE on each of your monitoring stations.  Then on you "main"
Nagios installation, write a small wrapper for the check_nrpe script that
runs it once for each monitoring station, compiles the results, and then
alerts based on the number of CRITICAL's returned from your monitoring
station.  The problem here is that if your main server is down...

2.  Use event handlers.  Disable notifications for the relevant
hosts/services, but have an event handler that, when a service goes
critical, checks with your other Nagios installations on the status of the
same service.  There are a number of ways that could be accomplished.  Then,
if the service is seen as down from all your Nagios installations, alert.
The tricky part there is how to have only 1 Nagios installation send that
alert rather than all of them?  If you designate one of them only as your
alerting station, then you run into the same problem as in my first idea.
There would need to be some way for your various Nagios installations to
communicate with each other and decide who would send out the alert.  

3.  Some combination of passive checks and check_cluster...I haven't thought
this method out very far yet, but I'm pretty sure you could rig something up
like that.  Still would have the problem of depending on one centralized
server.

I can't believe we're the only two people on this list wanting to do this.
Hopefully some others will chime in with their thoughts as well.

Andrew

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Detecting partial outages
Next message: Detecting partial outages
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list