service alert aggregation?
Roy Sigurd Karlsbakk
roy at karlsbakk.net
Tue Sep 30 10:37:21 CEST 2003
Change misccommands.conf to pipe to a script, and you're in :)
On Tue, 2003-09-30 at 04:43, Joshua Barratt wrote:
> I just spent a very interesting afternoon reading through the last few
> months of list archives, but was unable to come up with an answer to my
> question. I apoligize if this has been dealt with to death.
>
> Basically, quite often if there is a problem with a host, many of it's
> services will be down, but it will still be pingable. (The TCP/IP stack
> is a hardy beast.) Possible causes: disk filling up, ram+swap filling
> up, very heavy load, etc (even some kernel panics!) -- all of these can
> cause more than one service to become unreachable, and in many cases,
> *all* services unreachable -- but still the host check will not fail.
> This causes the admins to get a flurry of service down alerts, and, when
> the problem is corrected, a flurry of service up alerts.
>
> I tried doing the service dependency route, but the basic problem is
> still that because of the nagios scheduler, it may decide that the SMTP
> server is critical, say, 2 minutes before deciding that the service that
> SMTP depends on is critical, and thus you get paged for both.
>
> Is it possible to configure things so you don't have that problem? I
> understand escalations, but that still doesn't really solve things,
> unless I'm missing something. I'll still get individual pages for every
> individual service that is experiencing a problem.
>
> My idea (if simple configuration is not the solution) is to do something
> like this:
> When a service alert is generated, instead of being emailed directly, it
> is emailed (or piped) to a script. That script then communicates with
> the nagios daemon and shedules immediate checks for all the services on
> the affected server. It waits some suitable time period, and then
> packages all the alerts received within that window into a single
> message which it then sends to the admins. (The same process would
> happen with the service up alerts.)
>
> This might not be foolproof, but I think it would cut down on a lot of
> spurious paging.
>
> Has anyone else solved this problem?
>
> Thanks for any input,
>
> Joshua Barratt
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list