Nagios 3.1.1 eats cpu like mad

Ethan Galstad egalstad at nagios.org
Tue Jun 23 19:52:09 CEST 2009


Patch is in CVS now.  Can someone who was experience scheduling problems
with the 3.0.6 release test the latest 3.1.2 release?  If the problem
still persists, its likely in one of the following functions in
base/utils.c:

check_time_against_period()
get_next_valid_time()

These functions are more complicated now with the new timeperiod
exceptions and date formats, so a bug could likely exist here.

- Ethan Galstad


Andreas Ericsson wrote:
> There's a bug in Nagios 3.1.1, making it eat all available CPU even
> with a very small configuration (5 hosts, 12 service checks).
> 
> I sort of introduced it, as I didn't fully test the impact of a patch
> sent in before accepting it. Mea culpa, so I'll make sure to fix it.
> 
> For some reason, the patch shown inline below makes Nagios consume
> 100% CPU on my system. I don't know the reason for this, but I'll
> investigate it and see how it can be fixed. I *think* it happens
> because Nagios sees that "current_time" is valid and therefore
> returns precisely that from get_next_valid_time(), which means it
> pushes all the scheduled checks in front of it until enough time
> has passed since the check was last *run* before actually executing
> it. Obviously, that sucks major donkeyballs, so we really shouldn't
> do that. I'll need to check that up a bit more closely before I can
> say with 100% certainty that that's what's happening though.
> 
> -8<--8<--8<-
> commit 523e8c516df323a0bafe98ecb9222384fde62d6e
> Author: Andreas Ericsson <ae at op5.se>
> Date:   Fri May 22 01:38:28 2009 +0000
> 
>     Fix service rescheduling on clock skew/timeperiod change
>     
>     This patch ensures that services and hosts are never scheduled one
>     year into the future and set to never be rescheduled again.
>     
>     Previously, this could happen if the next preferred time happened
>     to already be valid, but stops being so because of clock skew or
>     someone changing the timeperiod definition between two Nagios
>     restarts while retaining scheduling info.
>     
>     Patch-sent-by: Ricardo Maraschini <ricardo.maraschini at opservices.com.br>
>     Signed-off-by: Andreas Ericsson <ae at op5.se>
> 
> diff --git a/base/checks.c b/base/checks.c
> index 9d5c497..ef50a20 100644
> --- a/base/checks.c
> +++ b/base/checks.c
> @@ -277,7 +277,7 @@ int run_scheduled_service_check(service *svc, int check_options, double latency)
>  				preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
>  
>  			/* make sure we rescheduled the next service check at a valid time */
> -			get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);
> +			get_next_valid_time(current_time,&next_valid_time,svc->check_period_ptr);
>  
>  			/* the service could not be rescheduled properly - set the next check time for next year, but don't actually reschedule it */
>  			if(time_is_valid==FALSE && next_valid_time==preferred_time){
> @@ -2792,7 +2792,7 @@ int run_scheduled_host_check_3x(host *hst, int check_options, double latency){
>  				preferred_time=current_time+((hst->check_interval<=0)?300:(hst->check_interval*interval_length));
>  
>  			/* make sure we rescheduled the next host check at a valid time */
> -			get_next_valid_time(preferred_time,&next_valid_time,hst->check_period_ptr);
> +			get_next_valid_time(current_time,&next_valid_time,hst->check_period_ptr);
>  
>  			/* the host could not be rescheduled properly - set the next check time for next year, but don't actually reschedule it */
>  			if(time_is_valid==FALSE && next_valid_time==preferred_time){
> -8<--8<--8<-
> 
> 

------------------------------------------------------------------------------




More information about the Developers mailing list