Service checks and retry check interval
Tom Valdes
Tom.Valdes at flamenconetworks.com
Thu Jun 17 15:35:07 CEST 2004
Ok, I understand that when the service check fails it moves on to the host check.
As you see, my max_check_attempts is set to 5 for the host check. Shouldn't this delay sending out a notification until it checks it 5 times? And once it's down, is there a way to speed up a check to determine recovery?
The problem I'm having is that if Nagios misses a ping due to network congestion or whatever, it takes 5 minutes to realize that nothing is really wrong when all that happen was a missed ping which may have been caught if it simply did another check before sending out a notification.
-----Original Message-----
From: Marc Powell [mailto:marc at ena.com]
Sent: Wednesday, June 16, 2004 7:38 PM
To: Tom Valdes; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval
That is correct and by design. Nagios must determine the status of a questionable host before it does anything else. If it didn't, the dependency and network reachability logic could be flawed as well as sending out spurious alerts for services on a host that is down when they really should be suppressed (http://nagios.sourceforge.net/docs/1_0/networkreachability.html and the Host Checks section of http://nagios.sourceforge.net/docs/1_0/checkscheduling.html).
--
Marc
p.s. Please post to the list in plain text format. It makes it much, much easier to reply with proper quoting and you're going to reach a much larger audience who can help you.
________________________________________
From: Tom Valdes [mailto:Tom.Valdes at flamenconetworks.com]
Sent: Wednesday, June 16, 2004 6:25 PM
To: Marc Powell; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval
I had changed the 10 retries to 5 after I grabbed the copy of the status. I did reload Nagios so that's just an old capture.
I think I understand what you mean about performing the host check and bypassing a service check, but it seems a retry_check_interval value is not allowed in the hosts.cfg
---------------services.cfg------------------
--------------------------------------------------
define service{
use generic-service ; Name of service template to use
host_name Test-Server
service_description PING
is_volatile 0
check_period workhours
max_check_attempts 5
normal_check_interval 5
retry_check_interval 1
contact_groups test-contact
notification_interval 960
notification_period workhours
notification_options c,r
check_command check_fping!50%!100%
}
--------------------------------------------------
------------------hosts.cfg-------------------
define host{
use generic-host ; Name of host template to use
parents switch1
host_name Test-Server
alias TestServer
address 10.0.0.21
check_command check-host-alive
max_check_attempts 5
notification_interval 30
notification_period 24x7
notification_options d,u,r
}
----------------------------------------------------
________________________________________
From: nagios-users-admin at lists.sourceforge.net on behalf of Marc Powell
Sent: Wed 6/16/2004 5:42 PM
To: Tom Valdes; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval
________________________________
>From: Tom Valdes [mailto:Tom.Valdes at flamenconetworks.com]
>Sent: Wednesday, June 16, 2004 2:55 PM
>To: nagios-users at lists.sourceforge.net
>Subject: [Nagios-users] Service checks and retry check interval
> I currently have my normal_check_interval set to 5 minutes
> If a service check is missed, I'd like it to retry 5
> times before sending a notification and I'd like the
> retry interval to be 1 minute. (can it be less?
> Like 10 seconds?)
>I've tried adding the following to services.cfg
> max_check_attempts 5
> normal_check_interval 5
> retry_check_interval 1
I presume this is for the service definition. Can we see the complete
definition?
> Shouldn't this retry a failed check every minute
> for 5 tries before sending a notification?
For the service above under normal circumstances, yes. I use 5,5,3 to
delay notifications by ~15 minutes.
> Using a test server, I pull the plug and Nagios
> catches the 100% ping loss but if I plug it back
> in as soon as it notices, Nagios emails me right
> away and doesn't return an Up state for another
> 5 minutes?
For the service or the host? See below.
> The following is what I receive on the status
> screen.. It shows a State Type: HARD.. Shouldn't
> it be in a SOFT state until it completes the
> max_check_attempts?
> Current Status: CRITICAL
> Status Information:FPING CRITICAL - 192.168.100.21 (loss=100.000000% )
> Current Attempt:1/10
Why is max attempts showing 10 here if it's defined as 5 above? Did you
restart nagios after making the change? Do you have multiple nagios
processing running?
There is a special situation that results when you just 'pull the plug'
on a machine you're monitoring. The service check will of course fail on
the first attempt. Nagios will then attempt to check the status of the
host using the host check_command. It will do this exclusively until
max_check_attempts defined for the host is reached and will not attempt
to recheck the status of the service if the host is determined to be
down or unreachable. At that point nagios will attempt to send a HOST
down notification which may be what you are seeing. Because of this
special situation, your retry_check_interval for the service has no
meaning. AFAIK, nagios just falls back to normal_check_interval until
one or more services on the host recovers (and the host by inference).
--
Marc
-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list