Escalations and recovery notifications
Stefan Giesen
Stefan.Giesen at firstgate.de
Fri Aug 27 19:36:37 CEST 2004
Hi,
I've run into a problem with Nagios escalations and notifications
regarding recovery alerts (maybe i'm just plain stupid, but I can't find
anything about this in the documentation or the FAQ database):
Given the following example setup:
1. The management wants to be informed when a specific service (in this
example an Apache web server) has gone down (notificate three times,
with five minutes interval - don't ask why exactly three times, it's
management, so no logic involved here ;-)
2. The management wants to be informed when this service gets back
online.
3. The sysadmin of the day should be informed when this service breaks
down. This notification should be repeated every 5 minutes until the
problem is acknowledged/solved.
4. The "backup" sysadmin (which isn't really on duty, but kind off - I
think you know what i mean) should be informed only after the first
sysadmin didn't respond after 4 notifications (so about 25 minutes after
the problem arised).
5. Both administrators should be informed again and again in 5 minute
interval until one of them acknowledges (or solves) the problem.
6. Each formerly informed sysadmin should be informed if the service
gets back online.
Part 3, 4, 5 and 6 are not a big problem, here's my escalation config
for them:
The service (with standard templates) itself:
#-----------------------------------------------------------------------------
# TEMPLATE: Generic service definition
define service{
name generic-service
is_volatile 0
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 1
register 0
}
#-----------------------------------------------------------------------------
# TEMPLATE: Standard 2 Minutes Active Check, Max Soft Checks 4
(eventhandler restart after 3 soft criticals possible if event_handler
defined)
define service{
use generic-service
name 2min-service-ev
check_period 24x7
max_check_attempts 4
normal_check_interval 2
retry_check_interval 1
notification_period 24x7
notification_interval 5
contact_groups monitor-email # not
really neccessary because of escalation config starting at 1
notification
notification_options u,c,r
register 0
}
#-----------------------------------------------------------------------------
# Apache HTTP Web Server
define service{
use 2min-service-ev
hostgroup_name webservers
service_description Apache
check_command check_http
}
#-----------------------------------------------------------------------------
# Template: standard first level escalation
define serviceescalation{
name std-escalation-1
first_notification 1
last_notification 0
contact_groups sysadmin1
notification_interval 5
register 0
}
#-----------------------------------------------------------------------------
# Template: standard second level escalation
define serviceescalation{
name std-escalation-2
first_notification 5
last_notification 0
contact_groups sysadmin2
notification_interval 5
register 0
}
#-----------------------------------------------------------------------------
# Apache Webservers
define serviceescalation{
use std-escalation-1
hostgroup_name webservers
service_description Apache
}
define serviceescalation{
use std-escalation-2
hostgroup_name webservers
service_description Apache
}
This works like a charm.
But now the tricky part:
define serviceescalation{
first_notification 1
last_notification 3
contact_groups management
notification_interval 5
hostgroup_name webservers
service_description Apache
}
First it works as expected: The service goes down, managaement get's
informed three times (every five minutes). OK for that, but now:
The service goes back online after - let's say - 9 notifications have
been send out (it was just after work, and both admins were in a big
traffic jam ;)
Now the following happens:
- Both administrators get recovery alerts (as expected) BUT
- the management gets no recovery alert at all.
Maybe i'm blind and dumb, but I can't figure out how to configure Nagios
to work as expected. As I said, I've read the documentation and the FAQ,
but I didn't find anything related to this in the escalation
examples/documentation.
My system: Nagios 1.2 on Debian stable. Everything else works as
expected (eventhandlers and so on), only that damn "only three
notifications and the recovery alert afterwards" won't work.
Does anybody know where I made the mistake? Or isn't Nagios capable of
doing what I need?
Thanks in advance,
Stefan
--
Stefan Giesen, Systemadministration Frankfurt
FIRSTGATE Internet AG, Im MediaPark 5, 50670 Koeln
Telefon: +49 (0) 2 21 / 45 45-745, Telefax: +49 (0) 2 21 / 45 45-710
Internet: www.firstgate.de eMail: Stefan.Giesen at firstgate.de
-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list