False positives after the Parent host recovered
Wheeler, MG
MG at ev3.net
Thu May 20 20:36:48 CEST 2004
It was my finding (and I have been wrong before) that the children "Services" won't report being down but the children "Hosts" will. That is why we don't do any notifications on just Hosts at all. We are only using notifications for services on the various hosts. We make sure we have at least one Service per host to validate that the host itself is working. We use the same check_ping that the Host Check does but we do it as a service so when a blocking host goes down we don't get any children notifications.
I could be wrong and if someone knows of a better way without entering every single host in the dependencies.cfg that would be great to hear about it.
-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net [mailto:nagios-users-admin at lists.sourceforge.net]On Behalf Of ling Zhang
Sent: Thursday, May 20, 2004 1:04 PM
To: nagios-users at lists.sourceforge.net
Cc: 'Gregory Bell'; harper.mann at comcast.com; CHui
Subject: [Nagios-users] False positives after the Parent host recovered
Hi,
I hope to get your input on a frustrating problem. Right after a "parent" host goes down and recovers, I receive a burst of notifications indicating that downstream "children" have gone down & recovered, even though that's not the case. Although this behavior doesn't happen every time a "parent" node goes down, my impression is that the odds are greater than 30%.
For example, suppose this my network:
Nagios--------Bridge------Parent Switch--------Child switch1
|
|
Child Switch2--------Child switch3
The series of events go like this:
1. Disconnect link between "Bridge" and "parent switch".
2. Nagios reports and only reports "parent switch" down. (good)
3. Re-connect link between "Bridge" and "parent switch".
4. Nagios reports "parent switch" recovered. (very good)
5. Nagios reports "child switch1" and "child switch2" down right after "parent switch" recovered. (what the?)
6. Nagios reports "child switch1" and "child switch2" recovered shortly. (????????)
Now, My nagios host configuration for the testing network looks like this:
define host{
name generic-Bridge
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 0 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1
retain_nonstatus_information 1
check_command check-host-alive
max_check_attempts 3
notification_interval 0
notification_period 24x7
notification_options d,r
register 0
}
define host{
name generic-switch ;
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 0 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1
retain_nonstatus_information 1
check_command check-host-alive
max_check_attempts 3
notification_interval 0
notification_period 24x7
notification_options d,r
register 0
}
define host {
use generic-bridge
host_name Bridge
address 1.1.1.1
}
define host {
use generic-switch
host_name parent-switch
address 1.1.1.10
parents Bridge
}
define host {
use generic-switch
host_name child-switch1
address 1.1.1.11
parents parent-switch
}
define host {
use generic-switch
host_name child-switch2
address 1.1.1.12
parents parent-switch
}
define host {
use generic-switch
host_name child-switch3
address 1.1.1.13
parents child-switch2
}
So, any idea on this?
Thanks.
Ling
_____________________________________________________________________
Message scanned for viruses
_____________________________________________________________________
This message has been checked for all known viruses
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20040520/95f2e952/attachment.html>
More information about the Users
mailing list