Antwort: Re: Check becomes unplanned
Sascha.Runschke at gfkl.com
Sascha.Runschke at gfkl.com
Wed Sep 10 10:59:24 CEST 2008
Hi Bernd,
hi Andreas,
> To alleviate your issue, you should be running an ntp daemon
> on the Nagios server which slews the clock into its right
> time rather than sets it (slew = make it go slightly faster
> or slower until it matches the correct time). Are you running
> ntpdate via a cronjob or something?
>
> I'm not sure how one would go about debugging this, as the
> time required to run a single test is prohibitive for rapid
> repeated testing.
I already encountered that problem before and started debugging it,
so I'll just share my knowledge so far. Sadly I didn't get the time
yet to really pinpoint a solution to it and produce a patch.
I'm not that big fan of C ;)
How to produce it:
- define a check "freaky_check" with limited check_period, let's
call it 7to11 and a check_interval of 3
- produce steady time-shifts backwards (nagios running in a VM someone?)
What happens:
1. it's 11pm, nagios schedules freaky_check for 7am according to its
check_period
2. Every X minutes timeshift -1 sec (jittering timesource)
3. nagios tries to compensate it and adjusts _all_ checks to the timeshift
(next_check = next_check - timeshift)
4. time goes by from 11pm to 6am, shifting time for - let's say - 8
minutes back
5. freaky_check is now scheduled for 6:52am because of the timeshifts
6. it's 6:52am and nagios tries to run the freaky_check according to the
schedule
7. sanity check says: ERROR: check outside check_period
8. nagios tries to compensate with a strange logic: next_check =
next_check + check_interval and just hopes it will fit
9. nagios reruns the sanity check: FATAL ERROR: check still outside
check_period - I have no clue what to do: rescheduling freaky_check:
next_check = next_check + 1year
10. user puzzled and nagios thinks it's all cool
Conclusion:
This behaviour turns up when the following criterias are met:
- check has a reduced check_period
- time is shifting back
- the timeshift outside the check_period is greater then 2 times the
check_interval
You can look it up in base/checks.c within the
run_scheduled_service_check(service *svc, int check_options, double
latency)
function for example.
After some basic checks this will be run:
/* attempt to run the check */
result=run_async_service_check(svc,check_options,latency,TRUE,TRUE,&time_is_valid,&preferred_time);
which in turn ends up with:
/* is the service check viable at this time? */
if(check_service_check_viability(svc,check_options,time_is_valid,preferred_time)==ERROR)
return ERROR;
No, since nagios shifted it outside its check_period, the time is NOT
valid.
Back in run_scheduled_service_check we now enter the (if result==ERROR)
tree:
/* get current time */
time(¤t_time);
/* determine next time we should check the service if needed */
/* if service has no check interval, schedule it again for 5 minutes from
now */
if(current_time>=preferred_time)
preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
COMMENT: nagios added the check_interval to preferred_time
/* make sure we rescheduled the next service check at a valid time */
get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);
COMMENT: No, it didn't do as adding check_interval was not enough to
compensate the backshift in time
/* the service could not be rescheduled properly - set the next check time
for next year, but don't
actually reschedule it */
if(time_is_valid==FALSE && next_valid_time==preferred_time){
COMMENT: nagios it bailing out here and just adding 1 year to
preferred_time to get the scheduler running again
svc->next_check=(time_t)(next_valid_time+(60*60*24*365));
svc->should_be_scheduled=FALSE;
log_debug_info(DEBUGL_CHECKS,1,"Unable to find any valid times to
reschedule the next service check!\n");
}
/* this service could be rescheduled... */
else{
svc->next_check=next_valid_time;
svc->should_be_scheduled=TRUE;
log_debug_info(DEBUGL_CHECKS,1,"Rescheduled next service check for
%s",ctime(&next_valid_time));
}
}
COMMENT: BÄNG - our check just got shoved to mars - landing in 1 year and
we don't even get
a notification for it and it does not orphan or whatever...
The question is now - what's the smartest way to handle this?
Basically I see 2 different approaches:
1. When compensating timeshifts - doublecheck that you do not move a check
outside its valid check_period
2. When trying to schedule checks, that somehow ran outside its
check_period - try to be smart and look for
the next valid time inside the check_period of that check instead of just
adding check_interval and naivly
hoping for it to be allright
Ok, so far from me - /discuss :-)
S
--
Sascha Runschke
IT-Infrastruktur
GFKL Financial Services AG
Limbecker Platz 1
45127 Essen
Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201)
102-1102105
GFKL Financial Services AG
Vorstand: Dr. Peter Jänsch (Vors.), Jürgen Baltes, Dr. Till Ergenzinger, Dr. Tom Haverkamp
Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma
Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20080910/7e69270c/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list