false host down alerts
Martin, Jeremy
jmartin at gsi-kc.com
Wed Jun 16 13:34:48 CEST 2004
Hi,
We have several nagios servers doing a total of about 1300 service
checks and 300 host checks using Nagios 1.2 and Nagios plugins 1.3.1.
Unfortunately something a little annoying keeps happening, not to
mention strange:
Nagios keeps sending HOST DOWN alerts when our hosts are not down. For
example we do a ping check and HTTP-QA check for a website. Nagios will
send a HOST DOWN alert, but at the same time, the ping check and HTTP
check will both be just fine. Nagios will think the host is down for
quite some time, but it keeps doing the ping and HTTP-QA checks anyway
despite thinking the host is down. The only way I can make it think the
host is back up is to totally restart Nagios, then it forgets that it
thought the host was down (even with retain_state_information=1)
At first this happened to a couple load balanced websites and mail
servers we had. Now this is happening to several other sites and mail
servers that are not being load balanced. Every time it says a host is
down like this, I can SSH into the Nagios server, and ping the exact
hostname Nagios is using (either the FQDN or the IP depending on what
Nagios is using in hosts.cfg for the given site), and the ping has no
problems at all.
Just to give an example - we often get HOST DOWN warnings for
"mail.ikea-usa.net" even though our SMTP and ping checks continue to be
OK long after the "HOST DOWN" alert. We also have this problem with
https://www.verepay.cc - but I think that's because we have ping turned
off in our firewall for that site at the moment. Our load balanced
anti-spam/virus mail servers located at scrubber.gsi-kc.com also suffer
from this problem but I've never had any troubles pinging them. Just
throwing out those examples incase anyone notices anything particularly
wrong with them, since Nagios seems to like those sites the best for
doing this odd "HOST DOWN" behavior.
Here's what I'll see in the nagios.log file:
[1087381552] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;1;Socket timeout
after 10 seconds
[1087381562] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;2;Socket timeout
after 10 seconds
[1087381572] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;3;Socket timeout
after 10 seconds
[1087381582] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;4;Socket timeout
after 10 seconds
[1087381592] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;5;Socket timeout
after 10 seconds
[1087381602] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;6;Socket timeout
after 10 seconds
[1087381612] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;7;Socket timeout
after 10 seconds
[1087381622] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;8;Socket timeout
after 10 seconds
[1087381632] HOST ALERT: scrubber.gsi-kc.com;DOWN;SOFT;9;Socket timeout
after 10 seconds
[1087381642] HOST ALERT: scrubber.gsi-kc.com;DOWN;HARD;10;Socket timeout
after 10 seconds
How can that be when I can do this at the same time?
[root at kgsinm05 var]# ping scrubber.gsi-kc.com
PING scrubber.gsi-kc.com (205.247.222.244) 56(84) bytes of data.
64 bytes from scrubber.gsi-kc.com (205.247.222.244): icmp_seq=1 ttl=240
time=28.3 ms
64 bytes from scrubber.gsi-kc.com (205.247.222.244): icmp_seq=2 ttl=240
time=26.8 ms
Thanks!!
Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20040616/d7307bf9/attachment.html>
More information about the Users
mailing list