How does nagios handle plugin exit not in [0, 1, 2, 3]?

Hendrik Bäcker andurin at process-zero.de
Fri Apr 20 20:53:22 CEST 2007



John Rouillard schrieb:
> In message <4628F199.8000502 at process-zero.de>, Hendrik Baecker writes:
> 
> I agree there should be an increase in latency, but 24x the latency
> for 10 services out of 2200+ (on 130 or so hosts) is what is weird.
> The host check would return almost immediately since the host was up,
> so there wasn't a big delay there.
> 
> Hmm, now that starts me thinking, but I think I am walking down the
> wrong path. The host check can occur in parallel with the outstanding
> service checks right? So if I have 12 outstanding checks, one of which
> fails, nagios doesn't wait for those 12 outstanding checks to finish
> (which could take up to a minute) before it does the host check, finds
> out the host is fine and starts the next cycle of checks?
> 

Are we talking about Nagios 2.x or 3.x?

In Nagios 2.x your 12 outstanding checks where scheduled for their
normal time.
If the check 1 of 12 returns a non-OK State the other 11 scheduled
checks were set to "hold" cause nagios has to immediate execute a host
check for the first.
AFAIK nagios doesn't care on the rest of eleven checks until the host
checks returns into a HARD State (reaching the max_check_attempt).

A few math:
Hostcheck command based on the plugin check_ping with a host check
timeout of 5 seconds and max_attempts on 4. Host has no parent!

In that case your rest of 11 service checks where hold on up to 20
seconds if the host is realy down, cause check_ping takes up the time
until timeout for a non reachable host (check_icmp in that case is much
faster).
In my opinion nagios is not doing anything else then waiting for the 5
second timeout for the max_check_attempt amount of times.

If you are using just a single parent host, the time for checking a
single host will be doubled for checking the parent too.

> When I first started I had fewer service checks (1900 or so) and the
> latency was larger, around 10-15 seconds, but not in the 2 minute
> range. Then I synced my test install with the current production
> nagios install and ran the 2200 checks. Then the latency jumped
> through the roof to 2 minutes which is 66% of the median polling
> interval.
> 

Yes. There seems to be a magic borderline around 2000 of service checks
in Nagios 2.x.

> Maybe it's an artifact of the scheduling process and how the service
> check interleaving occurs. I can't see nagios3's host polling changes
> making a difference though because in my scenario, it only took one fast
> ping to verify that the host was up, and all the nagios3 polling
> changes do is to run a number of host checks in parallel, so the delay
> would be the same.
> 

Did you tested this? Up to now I haven't got the chance to test the new
logic in a real manner.
But the difference of hande host checks, informing host parents and
childs should be accelerate the hole stuff I think.

Hendrik

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/




More information about the Developers mailing list