Host checks under Nagios 1.x
Aaron Devey
adevey at omniture.com
Tue Apr 22 02:30:48 CEST 2008
I had a similar problem to this. I only wanted to know if a
not-so-important device had been down for an hour or more.
Here's what I ended up doing:
I disabled the host check (by having it call an "always-ok" checkcommand
that always returns 0.) I then added a 'PING' service to the host with
a max_check_attempts of 7, and a retry_check_interval of 10 minutes.
The pitfall being that I no longer receive 'HOST DOWN' alerts for that
host. I instead receive alerts for a failing 'PING' service.
-Aaron
Andrew Cruse wrote:
> I've got an interesting problem with a particular setup. I'm monitoring a
> number of servers that the main Nagios installation doesn't have direct
> network access to, so I pass all of the host and service checks through an
> NRPE installation that can communicate with both Nagios and the servers
> being monitored. A little tweaking with check timeouts and whatnot and this
> setup works pretty nicely. I've run into a problem where for some reason,
> the NRPE server periodically stops responding to NRPE requests. Haven't
> gotten to the bottom of that (Connection refused) yet. Service checks are
> able to handle the problem fine as the duration of the NRPE outage is much
> shorter than the time it takes for the services to go into a hard critical
> state. The problem is, once the first service check goes through and goes
> into a soft critical state, that triggers the host checks which also fail
> (host checks go through NRPE as well) and immediately generate a
> notification. I'd like to find a way to make the host checks a little more
> forgiving as well.
>
> A few things I've thought of or tried:
>
> 1. I tried bumping up the host check retries to 30, but since the checks
> immediately fail with "connection refused" it runs through all 30 tries
> within just a few seconds. I also worry about this leading to unneeded load
> on the Nagios server since this is generally going to cause check_nrpe to be
> run 30 times, for each of the ~20 servers in this setup.
>
> 2. Extending the timeout on the check_nrpe commands doesn't help because
> "connection refused" is returned immediately.
>
> 3. Switching to a passive setup is probably the way to go, but for now am
> trying to avoid all the reconfiguration needed to move in that direction.
>
>
> Ideally what I'd like to be able to do is have the host checks retry on a
> particular interval (i.e. once per second) rather than instantly after the
> previous executed. Is there a way to do this?
>
> Incidentally, while typing up this email I was actually able to find the
> root problem with the NRPE setup. NRPE was being called via Xinetd which
> wasn't configured to allow enough simultaneous connections for a single
> service. Thus when it started getting hammered with NRPE requests as a
> result of the host check configuration it would stop allowing NRPE
> connections for 30 seconds. A quick change to the Xinetd config file seems
> to have solved the problem.
>
> I'm still interested to know how anyone handles the situation where a host
> may be unresponsive to host checks for a period of time yet you only wish to
> fire off a notification after a specific period of time. Would a wrapper
> around the host check be the only way to handle it?
>
> Andrew
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list