Max concurrent checks - spreading the next_time
Hiren Patel
hir3npatel at gmail.com
Sat Jun 13 11:29:40 CEST 2009
Ton Voon wrote:
> This is the test case:
> * set max_concurrent_checks=1 in nagios.cfg
> * create a host with 3 services with a check_interval of 1 minute
> * restart nagios
> * go to the host page and schedule a check for all services on the
> host (this makes all the services run at the same time)
> * tail nagios.log. Should see "Max concurrent service checks (1)
> has been reached"
> * on the host page, notice the last run time. Only one will be
> updated after 1 minute. All services get scheduled for the next time
> at the same time, and after the next minute, only one of those will
> have the last check time changed
>
yip exactly the behavior you describe. I setup a standalone machine
running the default checks against itself, and the queue shows them all
scheduled for the same time the next minute. also the log entries appear
as you describe.
> I've just committed a patch into CVS HEAD. This nudges the time ahead
> by 5 + random(10) seconds. I've also included a test case which
> ensures that the nudge factor is added in these cases.
>
> nagios.log will also have an entry which lists the affected service.
> If you get this message a lot on a regular system, then you need to
> consider increasing the max_concurrent_checks value.
>
> I'd be grateful if you could try this out.
>
with the patch, I see the check spread in the queue now, and all the
services are checked quicker than in the case without the patch, at
least this is what I noticed. there is one odd behavior, with the
default tests running, one check kept getting nudged, and as a result
wasn't run for a while. attached is the nagios.log, the first two
restarts are without the patch, and then with the patch. for the entire
duration I ran with the patch, the "current users" check had not been
run. am I doing something wrong in testing this though?
> Thinking some more, setting the next check time ahead doesn't really
> make sense, because the latency value does not reflect the fact that
> this active service's check time was delayed. Maybe this should be
> implemented as a remove of the event from the queue, and then re-added
> with a nudged event run time but the old service->next_check time.
>
> Anyhow, this should be better than it was.
agree about the latency, although it is logging the incident so users
should catch why their checks are running a little delayed. not sure
about the event queue and how it works yet, haven't looked at this part
of nagios.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios.log
Type: text/x-log
Size: 18322 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090613/0620cf3b/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list