Filtering out false alarms in unreliable network

Tuomas Toropainen tuomas.toropainen at lanwan.fi
Wed Oct 1 17:44:45 CEST 2008


The problem: how to filter out false alarms caused by short-time breaks in
an unreliable network.

Think about a simple monitoring scenario in which you only want to ping
various devices to see if they are up or not. So you have 200 hosts with
only one service (PING) for each.

For a reason or another, short-time breaks occur in the network. That is,
a particular host does not reply to PINGs for e.g. 30 seconds. These
breaks should not cause a notification to be sent.

What comes to services, the filtering is easy with max_check_attempt and
retry_check_interval. But the host check becomes a problem: after first
PING failure (soft state) the host is checked, and there is no
retry_check_interval for hosts. So the host is declared to be down
(almost) immediately.

The notifications about hosts can be delayed using
first_notification_delay. This seems to work fine except for one thing:
flap detection. Even if the notification is not sent, the host (and
service) is logged to have changed state, and when enough such state
changes occur, the host (and service) is placed in flapping state.

I do not want to disable flapping detection (or flapping notifications)
completely, because they might be useful in many cases. What I would like
to achieve is not to count those short-time outages when computing
flapping percent state changes. How can I accomplish that?

Should I go ahead and disable host checks completely? If there only was
retry_check_interval for hosts, it would solve all these problems.

I think it is quite common that the short-time outages are
network-related, i.e. the complete host is unresponsive instead of a
single service. When this is taken into account, it seems weird that there
is retry_check_interval for services but not for hosts. Or would it ruin
the scheduling logic?

Thanks for any help.

-tuomas


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list