Max concurrent checks - spreading the next_time
Ton Voon
ton.voon at opsera.com
Tue Jun 9 23:31:59 CEST 2009
Hi!
We've seen situations where this appears in the nagios.log:
Max concurrent service checks (50) has been reached. Delaying further
checks until previous checks are complete...
When switching on debugging, what we noticed is that services are
invoked all around the same time. I guess this happens when you have
selected a host and say "force check all services on this host".
What happens is that in the event code (base/events.c), it seems that
if this max_concurrent_checks is reached, then the service is ignored
and is rescheduled with a next check time based on the next regular
check interval. But if you do that, then all the other services will
still be invoked around the same time.
/* reschedule the check if we can't run it now */
if(run_event==FALSE){
/* remove the service check from the event queue and reschedule
it for a later time */
/* 12/20/05 since event was not executed, it needs to be
remove()'ed to maintain sync with event broker modules */
temp_event=event_list_low;
remove_event(temp_event,&event_list_low,&event_list_low_tail);
if(temp_service->state_type==SOFT_STATE && temp_service-
>current_state!=STATE_OK)
temp_service->next_check=(time_t)(temp_service->next_check+
(temp_service->retry_interval*interval_length));
else
temp_service->next_check=(time_t)(temp_service->next_check+
(temp_service->check_interval*interval_length));
temp_event->run_time=temp_service->next_check;
reschedule_event(temp_event,&event_list_low,&event_list_low_tail);
update_service_status(temp_service,FALSE);
run_event=FALSE;
}
I propose that instead of setting next_time = next_time +
check_interval, that there is a random factor added, maybe something
like:
next_time = now + max(5, min(int(rand(15)),
int(rand(retry_interval*interval_length))))
This means that the next check has been moved at least 5 seconds away
from now (to overcome the temporary load due to the number of
concurrent service checks), with a maximum of 15 seconds away (or less
if the retry_interval is lower).
Thoughts?
Ton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090609/69d5f85e/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list