HOST DOWN notification not getting resent
Quanah Gibson-Mount
quanah at stanford.edu
Thu Aug 26 00:06:50 CEST 2004
--On Wednesday, August 25, 2004 10:47 AM +0200 Andreas Ericsson <ae at op5.se>
wrote:
> Ok. To the bottom of this then.
> 1. Services will have nothing to do with it (except for failing, but you
> knew that already), so cut the line-noise.
Err, I was trying to cut the line noise by pointing out that services
worked fine. Should I instead remove all mention of services, and assume
that everyone on the list will magically realize that my services are
working just fine, and it is only the host down notifications that are a
problem?
> 2. Is the host down or unreachable?
Yes. Poweroff is a very nice command.
> 3. Are you positive the host hasn't gone to flapping state? Nagios 1.x
> doesn't notify for this. Nagios 2.0 has an option to do so.
Yes, absolutely positive. I can run a ping from another window that
consistently shows the host never returning anything.
> 4. You're sure you haven't set notification_interval to 0 in the host
> object definition (or anywhere else, for that matter)?
Yes. Especially since it is quite happy to send the *first* host down
alert, just not any following alerts.
> 5. You're sure nothing is wrong with the way notifications are sent?
Yes, because all service notifications are sent correctly, for hours on
end, if a host is up and its services have problems.
> 6. Have you tried running Nagios as a foreground process while producing
> errors like this in the configuration?
I'm not quite sure what you mean here. We always check Nagios through "-v"
before we apply our configuration, and our script that applies our
configuration won't let you install a bad configuration. So I'm not sure
what "errors likes this in the configuration" you are referring to?
> 7. Have you tried increasing the notification interval? I'm not sure what
> happens if Nagios 'misses' a scheduled notification, but it might just
> happily skip it and move on.
Our normal notification interval is 30 minutes for hosts. It doesn't work
at that setting either.
> 8. What's the normal load on the machine you're running Nagios at?
3-4 in the Solaris world. Note again that all service checks work just
fine at this load level.
> 9. Are you using the default notification commands, or have you written
> your own ones? If so, do they adhere to the NOTIFICATIONNUMBER macro?
I'm using the default notification command that came with Nagios.
> 10. Do you have a spamfilter in place? If so, remove it.
No, I do not.
> 11. Add an extra nofification command that looks like so;
> define command{
> command_name notification_stamp
> command_line date "+%Y:%m:%d %H:%M:%S" >>
> /home/quanah/Notifications.Timestamp
> }
I'll be happy to do that (to a different directory though).
> (mind the new-line) and make this the notification of choice for a
> lab-host you're trying. Watch the file grow if the host is down.
I'll watch and see 'if' the file grows when the host is down. ;)
> 12. If all of the above fails, try it again.
Um yeah, we've been dealing with this for about 8 months now.
> 13. If you're still out of luck then set up the simplest possible
> configuration (one host that you can bring up and down at wish), and make
> sure several notifications go out before you move to more advanced
> configuration. Make a host-template that you KNOW works with this, and
> use it for all hosts you want to resend notifications with.
I'll do that as soon as I have a secondary host to fiddle with the
configurations on. I can't just take out our production monitoring
service. ;)
> 14. Use the default nagios.cfg-file, just to be on the safe side.
I'll combine that with 13.
> 15. If problems persist, debug your mail-spooler.
My mail spooler is just fine. It sends out hundreds of messages from
Nagios every day.
> 16. If problems still persist, debug any relayhosts the mail passes
> through.
That would assume that all alerts were problematic. They aren't. There is
only one type of alert that is problematic. And I always get *one* of
those alerts, just no more or no less.
> 17. If the problems still persist, buy 3 hours of support from someone,
> and send them your configuration in a gzipped tarball.
No thanks.
Before even implementing Nagios here at Stanford, I read through the
configuration files & played with the setup for a few weeks. Then we
implemented it, and pushed it out. The configuration pieces are rather
simple, and the documentation was quite thorough. I'm not some 2-bit hack
who has problems understanding command prompts, etc. I've been
administering UNIX based systems & applications for over 10 years. I've
yet to see anyone be able to find anything in our configuration that
explains Nagios' behavior. Personally, I think it is a bug in Nagios
running under Solaris, and I've yet to see anything that contradicts that
assumption at all. We will be moving our Nagios service onto Debian soon,
and I'm most curious to see if the problem disappears at that time. If it
does, then at least I'll be able to point at the root cause.
--Quanah
--
Quanah Gibson-Mount
Principal Software Developer
ITSS/Shared Services
Stanford University
GnuPG Public Key: http://www.stanford.edu/~quanah/pgp.html
-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list