[naemon-users] Host checks occuring too fast
Robert Brockway
robert at timetraveller.org
Tue Jul 13 08:41:40 CEST 2021
Hi all. I looked at this for a while. Naturally I solved it soon after
mailing the list. Apparently I didn't understand the host check logic as
well as I thought I did. It's right there in the doco.
Hosts are checked by the Naemon daemon:
*At regular intervals, as defined by the check_interval and retry_interval options in your host definitions.
*On-demand when a service associated with the host changes state.
*On-demand as needed as part of the host reachability logic.
*On-demand as needed for predictive host dependency checks.
These hosts have a lot of service checks. Moving to a hard down state
after five checks makes sense now.
https://www.naemon.org/documentation/usersguide/hostchecks.html
I've used Nagios/Icinga a lot over the years and now I'm using Naemon.
In fact when I first used Nagios it was called Netsaint[1]. I don't
remember running in to this problem before. Perhaps the host check logic
has changed over the years. Either that or I ran in to this a decade or
two ago and just forgot.
So the solution is first_notification_delay.
Cheers,
Rob
[1] Before Netsaint I used Big Brother. Let us never speak of Big Brother
again.
On Tue, 13 Jul 2021, Robert Brockway wrote:
> Hi all. I have the following settings specified in the prod-linux-server
> host template:
>
> check_interval 3
> max_check_attempts 5
> retry_interval 3
>
> In addition interval_length is at the default value of 60 so both intervals
> above are measured in minutes.
>
> Despite this, host checks are occuring too fast. An example is below but
> this has happened many times. The problem is that Naemon is waking up staff
> during transient network failures. Our infrastructure has redundancy and
> hosts are configured to reboot as a result of various failure modes. The
> application is robust and copes with all this fine.
>
> As a result I don't want anyone woken up until the host has been down for
> about 15 minutes.
>
> Example checks on a down host as presented by Thruk:
>
> [2021-07-11 18:52:44] HOST ALERT: bob;DOWN;HARD;5;CRITICAL - Socket timeout
> after 10 seconds
> [2021-07-11 18:51:14] HOST ALERT: bob;DOWN;SOFT;4;CRITICAL - Socket timeout
> after 10 seconds
> [2021-07-11 18:50:59] HOST ALERT: bob;DOWN;SOFT;3;CRITICAL - Socket timeout
> after 10 seconds
> [2021-07-11 18:50:43] HOST ALERT: bob;DOWN;SOFT;2;CRITICAL - Socket timeout
> after 10 seconds
> [2021-07-11 18:50:12] HOST ALERT: bob;DOWN;SOFT;1;CRITICAL - Socket timeout
> after 10 seconds
>
> NB: The hostname isn't really called 'bob'.
>
> I thought perhaps that host freshness was the problem so I turned that off
> but it hasn't made a difference. We don't currently have any passive checks
> so I think it is safe to turn off host freshness.
>
> I'm going to set first_notification_delay to 10 minutes as a work-around.
> Even a 10 minute delay will be a lot better than what is happening now.
>
> Any help greatly appreciated.
>
> Cheers,
>
> Rob
>
More information about the Naemon-users
mailing list