Nagios retries checks too soon.
Jochen Bern
Jochen.Bern at LINworks.de
Fri Jun 10 21:15:28 CEST 2011
On 06/10/2011 07:48 PM, Paul M Dubuc wrote:
> Jochen Bern wrote:
>> IIRC, the actual
>> code adds check_interval/retry_interval to the variable that holds the
>> (previous) scheduled check time - i.e., the time when the previous check
>> supposedly was *started* (assuming negligible check latency).
>
> I was under the impression that the retry interval
> was only counted from the time the previous check completes and the
> status (which is needed to determine if a retry is necessary) is known.
> Why is the retry time determined before it's know that one is needed?
Hmmmmmm. It seems that I misremembered ... partially.
> # egrep -n 'current_time.*(check|retry)_interval' nagios-3.2.3/base/checks.c
> 276: preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
> 1825: preferred_time=current_time+check_interval;
> 1843: preferred_time=current_time+check_interval;
> 2814: preferred_time=current_time+((hst->check_interval<=0)?300:(hst->check_interval*interval_length));
> 3446: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3482: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3555: next_check=(unsigned long)(current_time+(hst->retry_interval*interval_length));
> 3559: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3585: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3603: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3705: next_check=(unsigned long)(current_time+(hst->retry_interval*interval_length));
> 3709: next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3879: preferred_time=current_time+check_interval;
> 3893: preferred_time=current_time+check_interval;
> # egrep -n 'last_check.*(check|retry)_interval' nagios-3.2.3/base/checks.c
> 1304: next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
> 1450: next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
> 1478: next_service_check=(time_t)(temp_service->last_check+(temp_service->retry_interval*interval_length));
> 1545: next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
Lemme have a closer look at the latter matches ...
They cover handle_async_service_check_result(). (Since there also is a
handle_async_host_check_result_3x() *elsewhere*, we clearly have
different behaviour between host and service checks.)
1304 is the catchall for STATE_OK results.
1450 is the special case for SOFT non-OK services on non-UP hosts.
1478 is its counterpart for UP hosts.
1545 covers HARD non-OK services.
Verification (looking at the *other* matches) ...
2814 through 3893 deal with *host* checks, 276 with *synchronous*
service checks (why is there no retry_interval??), 1825 and 1843 only
check viability, not results.
All in all, I'd say that async service checks, and *only* those, behave
the way I described. Not sure whether there may or may not be a *reason*
to ... anyone?
Kind regards,
J. Bern
--
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
More information about the Users
mailing list