How does nagios handle plugin exit not in [0, 1, 2, 3]?

John Rouillard rouilj+nagiosdev at cs.umb.edu
Sat Apr 21 05:38:04 CEST 2007


In message <46290C22.4070605 at process-zero.de>,
=?ISO-8859-1?Q?Hendrik_B=E4cker?= writes:
>John Rouillard schrieb:
>> In message <4628F199.8000502 at process-zero.de>, Hendrik Baecker writes:
>> 
>> I agree there should be an increase in latency, but 24x the latency
>> for 10 services out of 2200+ (on 130 or so hosts) is what is weird.
>> The host check would return almost immediately since the host was up,
>> so there wasn't a big delay there.
>> 
>> Hmm, now that starts me thinking, but I think I am walking down the
>> wrong path. The host check can occur in parallel with the outstanding
>> service checks right? So if I have 12 outstanding checks, one of which
>> fails, nagios doesn't wait for those 12 outstanding checks to finish
>> (which could take up to a minute) before it does the host check, finds
>> out the host is fine and starts the next cycle of checks?
>> 
>
>Are we talking about Nagios 2.x or 3.x?

Nagios 2.7.

>In Nagios 2.x your 12 outstanding checks where scheduled for their
>normal time.
>If the check 1 of 12 returns a non-OK State the other 11 scheduled
>checks were set to "hold" cause nagios has to immediate execute a host
>check for the first.
>AFAIK nagios doesn't care on the rest of eleven checks until the host
>checks returns into a HARD State (reaching the max_check_attempt).

Ok so it's what I was hoping for, the host check occurs at the same
time as the other 11 checks.

>A few math:
>Hostcheck command based on the plugin check_ping with a host check
>timeout of 5 seconds and max_attempts on 4. Host has no parent!
>
>In that case your rest of 11 service checks where hold on up to 20
>seconds if the host is realy down, cause check_ping takes up the time
>until timeout for a non reachable host (check_icmp in that case is much
>faster).
>In my opinion nagios is not doing anything else then waiting for the 5
>second timeout for the max_check_attempt amount of times.

But the host is up with a 13 ms RTT, so this should be fast, no 5
second timeout applies.

>If you are using just a single parent host, the time for checking a
>single host will be doubled for checking the parent too.

I thought in 2.x the order was:
    do a hostcheck on the host with the failing service
    only if that fails does nagios check the parent.

In my case the host check never fails since the host is up and
operating.

>> When I first started I had fewer service checks (1900 or so) and the
>> latency was larger, around 10-15 seconds, but not in the 2 minute
>> range. Then I synced my test install with the current production
>> nagios install and ran the 2200 checks. Then the latency jumped
>> through the roof to 2 minutes which is 66% of the median polling
>> interval.
>> 
>
>Yes. There seems to be a magic borderline around 2000 of service checks
>in Nagios 2.x.

Anybody have an idea why?  I would expect it to be linear in the total
number of services, but this is jumping off a cliff.

>> Maybe it's an artifact of the scheduling process and how the service
>> check interleaving occurs. I can't see nagios3's host polling changes
>> making a difference though because in my scenario, it only took one fast
>> ping to verify that the host was up, and all the nagios3 polling
>> changes do is to run a number of host checks in parallel, so the delay
>> would be the same.
>
>Did you tested this? Up to now I haven't got the chance to test the new
>logic in a real manner.
>But the difference of hande host checks, informing host parents and
>childs should be accelerate the hole stuff I think.

Only if there really is a downed host which is notthe case I am
describing. If the host is up there is no difference, one successful
ping occurs in either nagios 2.0 or 3.0 and nagios continues
scheduling.

If you have a real outage (which again I stress is NOT the case here)
3.0 wins because the parent checks occur in the same timespan as the
hostcheck of the host with the failing service.

I haven't tried the 3.0 code, but I can't see the parallelism as an
advantage as at least one ping occurs in either the 2.0 or 3.0 cases.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/




More information about the Developers mailing list