Escalations and recovery notifications

Tedman Eng teng at dataway.com
Fri Aug 27 21:44:45 CEST 2004


Since management is squelched on all but the first 3 notifications, they
wont recieve any notifications after that, whether they be down, ack, or up.
One way to fix this is to to define one management contact with just down
state notifications, and with escalations. Then define a second managemnt
contact with just recovery notifications, but no escalations.  Both use the
same email, both get included in the notification setting, but only one gets
notified for any condition, since they are complimentray to each other.


-----Original Message-----
From: Stefan Giesen [mailto:Stefan.Giesen at firstgate.de]
Sent: Friday, August 27, 2004 10:37 AM
To: Nagios User Maillist
Subject: [Nagios-users] Escalations and recovery notifications


Hi,

I've run into a problem with Nagios escalations and notifications
regarding recovery alerts (maybe i'm just plain stupid, but I can't find
anything about this in the documentation or the FAQ database):

Given the following example setup:
1. The management wants to be informed when a specific service (in this
example an Apache web server) has gone down (notificate three times,
with five minutes interval - don't ask why exactly three times, it's
management, so no logic involved here ;-)
2. The management wants to be informed when this service gets back
online.
3. The sysadmin of the day should be informed when this service breaks
down. This notification should be repeated every 5 minutes until the
problem is acknowledged/solved.
4. The "backup" sysadmin (which isn't really on duty, but kind off - I
think you know what i mean) should be informed only after the first
sysadmin didn't respond after 4 notifications (so about 25 minutes after
the problem arised).
5. Both administrators should be informed again and again in 5 minute
interval until one of them acknowledges (or solves) the problem.
6. Each formerly informed sysadmin should be informed if the service
gets back online.

Part 3, 4, 5 and 6 are not a big problem, here's my escalation config
for them:

The service (with standard templates) itself:

#---------------------------------------------------------------------------
--
# TEMPLATE: Generic service definition
define service{
        name                            generic-service
        is_volatile                     0
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               0
        retain_status_information       1
        retain_nonstatus_information    1
        register                        0
        }

#---------------------------------------------------------------------------
--
# TEMPLATE: Standard 2 Minutes Active Check, Max Soft Checks 4
(eventhandler restart after 3 soft criticals possible if event_handler
defined)
define service{
        use                             generic-service
        name                            2min-service-ev
        check_period                    24x7
        max_check_attempts              4
        normal_check_interval           2
        retry_check_interval            1
        notification_period             24x7
        notification_interval           5
        contact_groups                  monitor-email         # not
really neccessary because of escalation config starting at 1
notification
        notification_options            u,c,r
        register                        0
        }

#---------------------------------------------------------------------------
--
# Apache HTTP Web Server
define service{
        use                             2min-service-ev
        hostgroup_name                  webservers
        service_description             Apache
        check_command                   check_http
        }

#---------------------------------------------------------------------------
--
# Template: standard first level escalation
define serviceescalation{
        name                    std-escalation-1
        first_notification      1
        last_notification       0
        contact_groups          sysadmin1
        notification_interval   5
        register                0
        }

#---------------------------------------------------------------------------
--
# Template: standard second level escalation
define serviceescalation{
        name                    std-escalation-2
        first_notification      5
        last_notification       0
        contact_groups          sysadmin2
        notification_interval   5
        register                0
        }

#---------------------------------------------------------------------------
--
# Apache Webservers
define serviceescalation{
        use                     std-escalation-1
        hostgroup_name          webservers
        service_description     Apache
        }

define serviceescalation{
        use                     std-escalation-2
        hostgroup_name          webservers
        service_description     Apache
        }

This works like a charm.

But now the tricky part:

define serviceescalation{
        first_notification      1
        last_notification       3
        contact_groups          management
        notification_interval   5
        hostgroup_name          webservers
        service_description     Apache
        }

First it works as expected: The service goes down, managaement get's
informed three times (every five minutes). OK for that, but now:

The service goes back online after - let's say - 9 notifications have
been send out (it was just after work, and both admins were in a big
traffic jam ;)

Now the following happens:
- Both administrators get recovery alerts (as expected) BUT
- the management gets no recovery alert at all.

Maybe i'm blind and dumb, but I can't figure out how to configure Nagios
to work as expected. As I said, I've read the documentation and the FAQ,
but I didn't find anything related to this in the escalation
examples/documentation.

My system: Nagios 1.2 on Debian stable. Everything else works as
expected (eventhandlers and so on), only that damn "only three
notifications and the recovery alert afterwards" won't work.

Does anybody know where I made the mistake? Or isn't Nagios capable of
doing what I need?

Thanks in advance,
Stefan

-- 
Stefan Giesen, Systemadministration Frankfurt
FIRSTGATE Internet AG, Im MediaPark 5, 50670 Koeln
Telefon: +49 (0) 2 21 / 45 45-745, Telefax: +49 (0) 2 21 / 45 45-710
Internet: www.firstgate.de         eMail: Stefan.Giesen at firstgate.de



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list