Service checks not resuming after emergence from scheduled downtime
Eric Loyd
loyd at cyber.kodak.com
Mon Jan 10 17:32:23 CET 2005
We run Nagios v1.2 (February 02, 2004) to monitor things. We've been
running for a while now with stellar success (after a few idiosyncracies
were dealth with). Now, this past weekend, we ran into the following:
On Saturdays from 5pm-10pm, we do not monitor a particular service (DMS)
because the back-end is down for weekly maintenance. So we have a
check_period defined that allows monitoring all day M-F, all day Sunday,
and from 00:00-17:00,22:00-24:00 on Saturday. This works great. But
here's what happened. I've included the Nagios logs for anyone that wants
to take a look.
At 16:27, DMS was not running. Nagios turns SOFT CRITICAL and fires off an
event handler (which, internally, does nothing becuase this is not a HARD
CRITICAL). At 17:28, it does this again - now we're on SOFT CRITICAL
#2. At 16:29, it does it again, but because we've gone HARD CRITICAL, the
EKrestart event handler actually does something and starts the
service. Notification is sent out now as well. 10 minutes later (the
normal_check_interval) DMS is seen to be running and we go back into a HARD
OK. This cycle happens again at 16:49, and the process repeats up to that
last check.
At this point, at 16:51 is when the service went HARD CRITICAL and the
event handler does the restart. At 17:00 we go into a downtime window, and
the service is in a 3/3 HARD CRTICIAL state. Note that the next check
would have been approximately 17:01, but since it's in a downtime, it skips
this check. You'd think that it would pick it up after the donwtime, which
starts up at 22:00. So the next check should be 22:01 (give or take).
But Nagios never does this check. The logs show that it wasn't until 23:10
that someone forced a check via the command interface. In the meantime,
after the downtime window is complete, it does do notifications, but it
never does another check. And, to make it weirder, when the check is
forced, it doesn't immediately report back a status, but it's as if it
restarts the normal_check_interval timer, because it's 10 minutes later, at
23:21, that the service has a result (HARD OK), showing that someone had
gone ahead and restarted DMS outside the context of Nagios.
To sum up, it looks as if there is a bug when a service enters an outage
window in a HARD CRITICAL state, it does not start to check the service
when the outage window expires. Even when a forced check was performed,
all it seems to have done is restart the normal_check_interval timer.
Anyone else see anything like this?
The timeline major events:
16:49 - critical 1/3 SOFT
16:50 - critical 2/3 SOFT
16:51 - critical 3/3 HARD, event handler restarts DMS
17:00 - check_period is no longer valid (downtime)
22:00 - check_period is now valid (no more downtime)
22:00 - notification sent of service down
22:20 - notification sent of service down
22:40 - notification sent of service down
22:50 - notification sent of service down
23:00 - notification sent of service down
23:10 - service check forced by hand
23:21 - service check returns HARD OK state
23:21 - recovery sent via email
Raw logs:
Sat Jan 8 16:27:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;1;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:27:31 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;SOFT;1;EKrestart
Sat Jan 8 16:28:32 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;2;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:28:32 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;SOFT;2;EKrestart
Sat Jan 8 16:29:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;HARD;3;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:29:31 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 16:29:32 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;HARD;3;EKrestart
Sat Jan 8 16:39:31 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is
running on PID 1214
Sat Jan 8 16:39:31 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 1214
Sat Jan 8 16:39:31 2005 - SERVICE EVENT HANDLER: 908;DMS;OK;HARD;3;EKrestart
Sat Jan 8 16:49:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;1;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:49:31 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;SOFT;1;EKrestart
Sat Jan 8 16:50:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;2;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:50:31 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;SOFT;2;EKrestart
Sat Jan 8 16:51:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;HARD;3;CRITICAL:
DMS is not currently running.
Sat Jan 8 16:51:31 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 16:51:31 2005 - SERVICE EVENT HANDLER:
908;DMS;CRITICAL;HARD;3;EKrestart
Sat Jan 8 22:00:11 2005 - SERVICE NOTIFICATION:
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not
currently running.
Sat Jan 8 22:00:11 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 22:20:11 2005 - SERVICE NOTIFICATION:
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not
currently running.
Sat Jan 8 22:20:11 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 22:40:01 2005 - SERVICE NOTIFICATION:
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not
currently running.
Sat Jan 8 22:40:01 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 22:50:01 2005 - SERVICE NOTIFICATION:
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not
currently running.
Sat Jan 8 22:50:01 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 23:00:01 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently
running.
Sat Jan 8 23:10:56 2005 - EXTERNAL COMMAND:
SCHEDULE_FORCED_SVC_CHECK;908;DMS;1105243853
Sat Jan 8 23:21:01 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is
running on PID 26970
Sat Jan 8 23:21:01 2005 - SERVICE NOTIFICATION:
dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 26970
Sat Jan 8 23:21:02 2005 - SERVICE EVENT HANDLER: 908;DMS;OK;HARD;3;EKrestart
Lastly, some keywords for people searching the mailing lists for this
topic: service check checks not being performed done after scheduled
downtime downtime ends terminates finishes
--
Eric Loyd
Drive defensively. Buy a tank.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050110/b258a5a2/attachment.html>
More information about the Users
mailing list