Nagios occasionally does not send notifications when a service goes down
Toby Kraft
Toby_Kraft at KSAinc.com
Mon Feb 21 18:53:00 CET 2005
Hi all,
I've been using Nagios 1.2 (and Netsaint before) with some clients for a
while. One installation (on Fedora Core 2) has an issue where a service
will go down, but Nagios does not send any notification.
The service check is a simple tcp port check, the host_alive_check is
*default (ping), the host can be pinged. This host has one and only one
service. It's a pretty vanilla install and everything works fine most of
the time.
This past weekend, a host went down. No notifications were sent. Monday
morning the staff came in, saw the host was down and restarted it. After
they restarted the target host, Nagios then sent out a bunch of Host Down
alerts followed by a Host Up alert. Notifications for this server or host
were NOT disabled (nagios.log archives show they were enabled on 2/9/05).
Okay now you're saying - it's your mail server. But Nagios did not log
any notifications at the time of the problem!
The Host Alert History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005
[02-20-2005 18:08:43] SERVICE ALERT: ucisvr5.champlabs.com;Sandbox -
DB;CRITICAL;HARD;1;Connection refused or timed out
[02-20-2005 18:08:43] HOST ALERT:
ucisvr5.champlabs.com;DOWN;HARD;3;/bin/ping -n -U -c 1
ucisvr5.champlabs.com
[02-20-2005 18:08:40] HOST ALERT:
ucisvr5.champlabs.com;DOWN;SOFT;2;/bin/ping -n -U -c 1
ucisvr5.champlabs.com
[02-20-2005 18:08:37] HOST ALERT:
ucisvr5.champlabs.com;DOWN;SOFT;1;/bin/ping -n -U -c 1
ucisvr5.champlabs.com
The Host Notification History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005
No notifications have been recorded for this host in this archived log
file
The Service Alert History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005
[02-20-2005 18:08:43] SERVICE ALERT: ucisvr5.champlabs.com;Sandbox -
DB;CRITICAL;HARD;1;Connection refused or timed out
The Service Notification History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005
No notifications have been recorded for this service in this archived log
file
It seems that this occurs after Nagios has been up and running for a
while. The system and Nagsio have been up for 11 days which doesn't seem
like a long time.
Mainly just fishing for any ideas on what could cause this or how to
troubleshoot the problem. It would be nice if Nagios logged some info
when it processes an event and then decides NOT to send a notification,
like "Notification for event xxxx suppressed because yyyyy" or some such.
Thanks for listening. I'll check into any debug and/or logging options.
Toby
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050221/0240ab4d/attachment.html>
More information about the Users
mailing list