How does nagios handle plugin exit not in [0, 1, 2, 3]?
Andreas Ericsson
ae at op5.se
Mon Apr 23 11:17:42 CEST 2007
John Rouillard wrote:
> In message <46290C22.4070605 at process-zero.de>,
> =?ISO-8859-1?Q?Hendrik_B=E4cker?= writes:
>> John Rouillard schrieb:
>>> In message <4628F199.8000502 at process-zero.de>, Hendrik Baecker writes:
>>>
>>> I agree there should be an increase in latency, but 24x the latency
>>> for 10 services out of 2200+ (on 130 or so hosts) is what is weird.
>>> The host check would return almost immediately since the host was up,
>>> so there wasn't a big delay there.
>>>
>>> Hmm, now that starts me thinking, but I think I am walking down the
>>> wrong path. The host check can occur in parallel with the outstanding
>>> service checks right? So if I have 12 outstanding checks, one of which
>>> fails, nagios doesn't wait for those 12 outstanding checks to finish
>>> (which could take up to a minute) before it does the host check, finds
>>> out the host is fine and starts the next cycle of checks?
>>>
>> Are we talking about Nagios 2.x or 3.x?
>
> Nagios 2.7.
>
>> In Nagios 2.x your 12 outstanding checks where scheduled for their
>> normal time.
>> If the check 1 of 12 returns a non-OK State the other 11 scheduled
>> checks were set to "hold" cause nagios has to immediate execute a host
>> check for the first.
>> AFAIK nagios doesn't care on the rest of eleven checks until the host
>> checks returns into a HARD State (reaching the max_check_attempt).
>
> Ok so it's what I was hoping for, the host check occurs at the same
> time as the other 11 checks.
>
That doesn't really matter, as the next batch of checks gets delayed
until the hostchecks complete. Read on below.
>> A few math:
>> Hostcheck command based on the plugin check_ping with a host check
>> timeout of 5 seconds and max_attempts on 4. Host has no parent!
>>
>> In that case your rest of 11 service checks where hold on up to 20
>> seconds if the host is realy down, cause check_ping takes up the time
>> until timeout for a non reachable host (check_icmp in that case is much
>> faster).
>> In my opinion nagios is not doing anything else then waiting for the 5
>> second timeout for the max_check_attempt amount of times.
>
> But the host is up with a 13 ms RTT, so this should be fast, no 5
> second timeout applies.
>
If you use bog standard check_ping, the answer "13 ms RTT" takes 5
seconds to arrive at. If you use check_icmp in its host-alive mode
(ln -s check_icmp check_host, execute check_host $HOSTADDRESS$), it
will instead take ~15ms to complete, depending on the load/link-time
of the binary.
>> If you are using just a single parent host, the time for checking a
>> single host will be doubled for checking the parent too.
>
> I thought in 2.x the order was:
> do a hostcheck on the host with the failing service
> only if that fails does nagios check the parent.
>
True.
> In my case the host check never fails since the host is up and
> operating.
>
>>> When I first started I had fewer service checks (1900 or so) and the
>>> latency was larger, around 10-15 seconds, but not in the 2 minute
>>> range. Then I synced my test install with the current production
>>> nagios install and ran the 2200 checks. Then the latency jumped
>>> through the roof to 2 minutes which is 66% of the median polling
>>> interval.
>>>
>> Yes. There seems to be a magic borderline around 2000 of service checks
>> in Nagios 2.x.
>
> Anybody have an idea why? I would expect it to be linear in the total
> number of services, but this is jumping off a cliff.
>
Because of resource starvation. There are some recent changes to mitigate
this, although I'm not sure if they went in 2.x-maint or 3.x
>>> Maybe it's an artifact of the scheduling process and how the service
>>> check interleaving occurs. I can't see nagios3's host polling changes
>>> making a difference though because in my scenario, it only took one fast
>>> ping to verify that the host was up, and all the nagios3 polling
>>> changes do is to run a number of host checks in parallel, so the delay
>>> would be the same.
>> Did you tested this? Up to now I haven't got the chance to test the new
>> logic in a real manner.
>> But the difference of hande host checks, informing host parents and
>> childs should be accelerate the hole stuff I think.
>
> Only if there really is a downed host which is notthe case I am
> describing. If the host is up there is no difference, one successful
> ping occurs in either nagios 2.0 or 3.0 and nagios continues
> scheduling.
>
> If you have a real outage (which again I stress is NOT the case here)
> 3.0 wins because the parent checks occur in the same timespan as the
> hostcheck of the host with the failing service.
>
And at the same time as the following batch of service checks. This is
where the *real* benefit lies.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
More information about the Developers
mailing list