False positives after the Parent host recovered
ling Zhang
lzhang at lbl.gov
Sat May 29 02:40:18 CEST 2004
Hi,
This is the follow up to my first post.
I summary, the problem went away after changing the
service_check_timeout and host_check_timeout parameters of the main
configuration file to a smaller value (in our case, 5 seconds). Those
values were originally set to 60 seconds. Without reading the source
code in detail, I can only provide a guess to the cause of the problem:
Imagine a situation like this (using the same testing network in my
first post and keep the default timeout values):
Nagios--------Bridge------Parent Switch--------Child switch1
|
|
Child Switch2--------Child switch3
First, disconnect the link between "bridge" and "parent switch".
Second, Nagios reports and only reports "parent switch down".
After a few minutes, assuming the first event in the following list
happened at time T1,
T1: A check_host request to "child switch1" sent out by nagios.
T1+35sec: Connection between Bridge and Parent Switch
restored.
T1+45sec: A check_host request to "parent switch" was sent by
nagios and returned with "OK" status. Nagios in turn clears the "parent
switch" down event.
T1+60sec: The check_host request to "child switch1" finally
timed out. Now, since "child-switch1"'s parent device is "UP", nagios in
turn considers "child- switch1" as down and
sends out alert based on that.
T1+240sec (assuming the service check interval is set to 3
mins): A check_host request to "child switch1" was sent by nagios and
returned with "OK" status. Nagios in turn clears
the "child switch1" down event.
If my theory above is right, reducing the timeout value REDUCES but does
not eliminate the chance for the "recovery flapping" problem.
So, what do you think?
Thanks.
Ling
-----Original Message-----
From: Wheeler, MG [mailto:MG at ev3.net]
Sent: Thursday, May 20, 2004 11:37 AM
To: ling Zhang; nagios-users at lists.sourceforge.net
Cc: Gregory Bell; harper.mann at comcast.com; CHui
Subject: RE: [Nagios-users] False positives after the Parent host
recovered
It was my finding (and I have been wrong before) that the children
"Services" won't report being down but the children "Hosts" will. That
is why we don't do any notifications on just Hosts at all. We are
only using notifications for services on the various hosts. We make sure
we have at least one Service per host to validate that the host itself
is working. We use the same check_ping that the Host Check does but we
do it as a service so when a blocking host goes down we don't get any
children notifications.
I could be wrong and if someone knows of a better way without entering
every single host in the dependencies.cfg that would be great to hear
about it.
-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net
[mailto:nagios-users-admin at lists.sourceforge.net]On Behalf Of ling Zhang
Sent: Thursday, May 20, 2004 1:04 PM
To: nagios-users at lists.sourceforge.net
Cc: 'Gregory Bell'; harper.mann at comcast.com; CHui
Subject: [Nagios-users] False positives after the Parent host recovered
Hi,
I hope to get your input on a frustrating problem. Right after a
parent host goes down and recovers, I receive a burst of notifications
indicating that downstream children have gone down & recovered, even
though thats not the case. Although this behavior doesnt happen every
time a parent node goes down, my impression is that the odds are
greater than 30%.
For example, suppose this my network:
Nagios--------Bridge------Parent Switch--------Child switch1
|
|
Child Switch2--------Child switch3
The series of events go like this:
1. Disconnect link between "Bridge" and "parent switch".
2. Nagios reports and only reports "parent switch" down. (good)
3. Re-connect link between "Bridge" and "parent switch".
4. Nagios reports "parent switch" recovered. (very good)
5. Nagios reports "child switch1" and "child switch2" down right after
"parent switch" recovered. (what the?)
6. Nagios reports "child switch1" and "child switch2" recovered shortly.
(????????)
Now, My nagios host configuration for the testing network looks like
this:
define host{
name generic-Bridge
notifications_enabled 1
; Host notifications are enabled
event_handler_enabled 1
; Host event handler is enabled
flap_detection_enabled 0
; Flap detection is enabled
process_perf_data 1
; Process performance data
retain_status_information 1
retain_nonstatus_information 1
check_command check-host-alive
max_check_attempts 3
notification_interval 0
notification_period 24x7
notification_options d,r
register 0
}
define host{
name generic-switch
;
notifications_enabled 1
; Host notifications are enabled
event_handler_enabled 1
; Host event handler is enabled
flap_detection_enabled 0
; Flap detection is enabled
process_perf_data 1
; Process performance data
retain_status_information 1
retain_nonstatus_information 1
check_command check-host-alive
max_check_attempts 3
notification_interval 0
notification_period 24x7
notification_options d,r
register 0
}
define host {
use generic-bridge
host_name Bridge
address 1.1.1.1
}
define host {
use generic-switch
host_name parent-switch
address 1.1.1.10
parents Bridge
}
define host {
use generic-switch
host_name child-switch1
address 1.1.1.11
parents parent-switch
}
define host {
use generic-switch
host_name child-switch2
address 1.1.1.12
parents parent-switch
}
define host {
use generic-switch
host_name child-switch3
address 1.1.1.13
parents child-switch2
}
So, any idea on this?
Thanks.
Ling
_____________________________________________________________________
Message scanned for viruses
_____________________________________________________________________
This message has been checked for all known viruses
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id149&alloc_id66&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list