False positives after the Parent host recovered

ling Zhang lzhang at lbl.gov
Sat May 29 02:40:18 CEST 2004
Previous message: False positives after the Parent host recovered
Next message: False positives after the Parent host recovered
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, 

This is the follow up to my first post.

I summary, the problem went away after changing the
service_check_timeout and host_check_timeout parameters of the main
configuration file to a smaller value (in our case, 5 seconds). Those
values were originally set to 60 seconds. Without reading the source
code in detail, I can only provide a guess to the cause of the problem: 

Imagine a situation like this (using the same testing network in my
first post and keep the default timeout values):


Nagios--------Bridge------Parent Switch--------Child switch1
					|
					|
				  Child Switch2--------Child switch3


First, disconnect the link between "bridge" and "parent switch".
Second, Nagios reports and only reports "parent switch down". 

After a few minutes, assuming the first event in the following list
happened at time T1,
 
	T1: A check_host request to "child switch1" sent out by nagios.
	T1+35sec: Connection between “Bridge” and “Parent Switch”
restored.
	T1+45sec: A check_host request to "parent switch" was sent by
nagios and returned with "OK" status. Nagios in turn clears the "parent
switch" down event.
	T1+60sec: The check_host request to "child switch1" finally
timed out. Now, since "child-switch1"'s parent device is "UP", nagios in
turn considers "child-				switch1" as down and
sends out alert based on that.
	T1+240sec (assuming the service check interval is set to 3
mins): A check_host request to "child switch1" was sent by nagios and
returned with "OK" status. 			Nagios in turn clears
the "child switch1" down event.

If my theory above is right, reducing the timeout value REDUCES but does
not eliminate the chance for the "recovery flapping" problem. 

So, what do you think?

Thanks.

Ling



-----Original Message-----
From: Wheeler, MG [mailto:MG at ev3.net] 
Sent: Thursday, May 20, 2004 11:37 AM
To: ling Zhang; nagios-users at lists.sourceforge.net
Cc: Gregory Bell; harper.mann at comcast.com; CHui
Subject: RE: [Nagios-users] False positives after the Parent host
recovered

It was my finding (and I have been wrong before) that the children
"Services" won't report being down but the children "Hosts" will. That
is why we don't do any notifications on just Hosts at all. We are
only using notifications for services on the various hosts. We make sure
we have at least one Service per host to validate that the host itself
is working. We use the same check_ping  that the Host Check does but we
do it as a service so when a blocking host goes down we don't get any
children notifications.
 
I could be wrong and if someone knows of a better way without entering
every single host in the dependencies.cfg that would be great to hear
about it.
 
 
-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net
[mailto:nagios-users-admin at lists.sourceforge.net]On Behalf Of ling Zhang
Sent: Thursday, May 20, 2004 1:04 PM
To: nagios-users at lists.sourceforge.net
Cc: 'Gregory Bell'; harper.mann at comcast.com; CHui
Subject: [Nagios-users] False positives after the Parent host recovered
Hi,

I hope to get your input on a frustrating problem.  Right after a
“parent” host goes down and recovers, I receive a burst of notifications
indicating that downstream “children” have gone down & recovered, even
though that’s not the case.  Although this behavior doesn’t happen every
time a “parent” node goes down, my impression is that the odds are
greater than 30%. 

For example, suppose this my network:


Nagios--------Bridge------Parent Switch--------Child switch1
					|
					|
				  Child Switch2--------Child switch3

The series of events go like this:

1. Disconnect link between "Bridge" and "parent switch". 
2. Nagios reports and only reports "parent switch" down. (good) 
3. Re-connect link between "Bridge" and "parent switch". 
4. Nagios reports "parent switch" recovered. (very good)
5. Nagios reports "child switch1" and "child switch2" down right after
"parent switch" recovered. (what the?) 
6. Nagios reports "child switch1" and "child switch2" recovered shortly.
(????????)


Now, My nagios host configuration for the testing network looks like
this:

define host{
        name                                            generic-Bridge
        notifications_enabled                           1
; Host notifications are enabled
        event_handler_enabled                           1
; Host event handler is enabled
        flap_detection_enabled                          0
; Flap detection is enabled
        process_perf_data                               1
; Process performance data
        retain_status_information                       1

        retain_nonstatus_information                    1

        check_command                                   check-host-alive
        max_check_attempts                              3
        notification_interval                           0
        notification_period                             24x7
        notification_options                            d,r
        register                                        0

        }


define host{
        name                                            generic-switch
;
        notifications_enabled                           1
; Host notifications are enabled
        event_handler_enabled                           1
; Host event handler is enabled
        flap_detection_enabled                          0
; Flap detection is enabled
        process_perf_data                               1
; Process performance data
        retain_status_information                       1

        retain_nonstatus_information                    1

        check_command                                   check-host-alive
        max_check_attempts                              3
        notification_interval                           0
        notification_period                             24x7
        notification_options                            d,r

        register                                        0

        }


define host {
	  use				  generic-bridge
        host_name               Bridge
        address                 1.1.1.1
}   


define host {
	  use				generic-switch
        host_name             parent-switch
	  address			1.1.1.10
        parents               Bridge
}   
 
define host {
	  use				generic-switch
        host_name             child-switch1
	  address			1.1.1.11
        parents               parent-switch
}
 

define host {
	  use				generic-switch
        host_name             child-switch2
	  address			1.1.1.12
        parents               parent-switch
}


define host {
	  use				generic-switch
        host_name             child-switch3
	  address			1.1.1.13
        parents               child-switch2
}


So, any idea on this?

Thanks.

Ling




_____________________________________________________________________
Message scanned for viruses

_____________________________________________________________________
This message has been checked for all known viruses




-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id149&alloc_id66&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: False positives after the Parent host recovered
Next message: False positives after the Parent host recovered
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list