Recovery not getting sent during downtime?

srunschke at abit.de srunschke at abit.de
Mon Jul 31 10:48:56 CEST 2006


(repost from nagios-devel)

Hi folks,

I'm currently using Nagios 2.0b3 (never change a running system ;)) and 
ran into the following problem:

Service went critical
SMS and emails got dispatched
found problem, decided to reboot the machine to fix it
scheduled downtime for host
rebooted host
everything went ok again
no SMS/email got dispatched to state the service recovered though!

I'm unsure if this problem was already fixed, I didn't find any real
evidence in google or the changelogs. Though fixes in the
recovery logics and notifcation system itself were documented,
they weren't too detailed though.

Question: is this a bug or feature? If it is a bug, has it been fixed in
a newer release which I can update to?

It poses a problem to us as admins that are currently offsite don't get
messages that the problem is ok already. So we get quite some unnecessary
phonecalls to check for a problem that is already solved.

Here's an excerpt how it looked like in the nagios log:

[1153954542] SERVICE ALERT: NSEXT01;NOTES;CRITICAL;SOFT;1;Connection 
refused
[1153954600] SERVICE ALERT: NSEXT01;NOTES;CRITICAL;SOFT;2;Connection 
refused
[1153954660] SERVICE ALERT: NSEXT01;NOTES;CRITICAL;HARD;3;Connection 
refused
[1153954660] SERVICE NOTIFICATION: 
RGingter;NSEXT01;NOTES;CRITICAL;notify-by-email;Connection refused
[1153954660] SERVICE NOTIFICATION: 
MArslan;NSEXT01;NOTES;CRITICAL;notify-by-email;Connection refused
[1153954660] SERVICE NOTIFICATION: 
IT_Service;NSEXT01;NOTES;CRITICAL;notify-by-email;Connection refused
[1153955260] SERVICE NOTIFICATION: 
RGingter_SMS;NSEXT01;NOTES;CRITICAL;notify-by-sms;Connection refused
[1153955260] SERVICE NOTIFICATION: 
MArslan_SMS;NSEXT01;NOTES;CRITICAL;notify-by-sms;Connection refused
...rest of alerts snipped out...
[1153980519] EXTERNAL COMMAND: 
SCHEDULE_HOST_DOWNTIME;NSEXT01;1153980509;1153981829;1;0;7200;technik;Neustart 

MAr
[1153980519] HOST DOWNTIME ALERT: NSEXT01;STARTED; Host has entered a 
period of scheduled downtime
[1153980595] HOST ALERT: NSEXT01;DOWN;SOFT;1;CRITICAL - 10.150.1.2: rta 
nan, lost 100%
[1153980605] HOST ALERT: NSEXT01;DOWN;SOFT;2;CRITICAL - 10.150.1.2: rta 
nan, lost 100%
[1153980615] HOST ALERT: NSEXT01;DOWN;HARD;3;CRITICAL - 10.150.1.2: rta 
nan, lost 100%
[1153980615] SERVICE ALERT: NSEXT01;PING;CRITICAL;HARD;1;CRITICAL - 
10.150.1.2: rta nan, lost 100%
[1153980687] SERVICE ALERT: NSEXT01;CPU;CRITICAL;HARD;1;CRITICAL - Socket 
timeout after 10 seconds
[1153980687] SERVICE ALERT: NSEXT01;UPTIME;CRITICAL;HARD;1;CRITICAL - 
Socket timeout after 10 seconds
[1153980687] SERVICE ALERT: NSEXT01;DISK_C;CRITICAL;HARD;1;CRITICAL - 
Socket timeout after 10 seconds
[1153980707] HOST ALERT: NSEXT01;UP;HARD;1;OK - 10.150.1.2: rta 1.382ms, 
lost 0%
[1153980707] SERVICE ALERT: NSEXT01;PING;OK;HARD;1;OK - 10.150.1.2: rta 
3.307ms, lost 0%
[1153980767] SERVICE ALERT: NSEXT01;NOTES;CRITICAL;SOFT;1;Connection 
refused
[1153980805] SERVICE ALERT: NSEXT01;MEMUSE;CRITICAL;SOFT;1;Connection 
refused
[1153980805] SERVICE ALERT: NSEXT01;DISK_D;CRITICAL;SOFT;1;Connection 
refused
[1153980805] SERVICE ALERT: NSEXT01;DISK_E;CRITICAL;SOFT;1;Connection 
refused
[1153980828] SERVICE ALERT: NSEXT01;NOTES;OK;SOFT;2;TCP OK - 0.070 second 
response time on port 1352
[1153980976] SERVICE ALERT: NSEXT01;CPU;OK;HARD;1;CPU Load 37% (10 min 
average)
[1153980976] SERVICE ALERT: NSEXT01;UPTIME;OK;HARD;1;System Uptime - 0 
day(s) 0 hour(s) 5 minute(s)
[1153980976] SERVICE ALERT: NSEXT01;DISK_C;OK;HARD;1;C:\ - total: 3.00 Gb 
- used: 2.05 Gb (68%) - free 0.95 Gb (32%)
[1153981105] SERVICE ALERT: NSEXT01;MEMUSE;OK;SOFT;2;Memory usage: 
total:1951.26 Mb - used: 434.44 Mb (22%) - free: 1516.82 Mb (78%)
[1153981105] SERVICE ALERT: NSEXT01;DISK_D;OK;SOFT;2;D:\ - total: 5.43 Gb 
- used: 2.46 Gb (45%) - free 2.97 Gb (55%)
[1153981105] SERVICE ALERT: NSEXT01;DISK_E;OK;SOFT;2;E:\ - total: 67.83 Gb 

- used: 14.92 Gb (22%) - free 52.91 Gb (78%)
[1153981832] HOST DOWNTIME ALERT: NSEXT01;STOPPED; Host has exited from a 
period of scheduled downtime

Any insight in this would be appreciated.

sincerely
        Sascha

--
Sascha Runschke
Netzwerk Management
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:SRunschke at abit.de

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
  Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list