How does nagios handle plugin exit not in [0, 1, 2, 3]?
John Rouillard
rouilj+nagiosdev at cs.umb.edu
Fri Apr 20 20:10:35 CEST 2007
In message <4628F199.8000502 at process-zero.de>, Hendrik Baecker writes:
>John Rouillard schrieb:
>> How does nagios handle plugins that don't exit with an errorcode of 0,
>> 1, 2, or 3? If the plugin exits with say 127, is the host check logic
>> triggered?
>
>Its triggered by every service check that does not return with return
>code 0.
That's pretty much what I figured.
>> The reason I ask is that I just fixed ~10 services (that were run
>> every 3 minutes) that were failing (java was missing from the system),
>> and the average latency went from 2 minutes to less than 6 seconds
>> (usually in the .6 range), with the max at 25 seconds. This seems a
>> huge difference given the fix.
>>
>> I could almost buy it if each failure triggered a halt to polling and
>> forced it to do host checks, but even then the magnitude of the change
>> is a bit unbelievable.
>
>Host checks goes to a high prio scheduling queue and are checked before
>other service checks to determine if nagios should write x service alert
>or just a single host alert.
>Cause of the high prio host checks your service checks may go into latency.
I agree there should be an increase in latency, but 24x the latency
for 10 services out of 2200+ (on 130 or so hosts) is what is weird.
The host check would return almost immediately since the host was up,
so there wasn't a big delay there.
Hmm, now that starts me thinking, but I think I am walking down the
wrong path. The host check can occur in parallel with the outstanding
service checks right? So if I have 12 outstanding checks, one of which
fails, nagios doesn't wait for those 12 outstanding checks to finish
(which could take up to a minute) before it does the host check, finds
out the host is fine and starts the next cycle of checks?
When I first started I had fewer service checks (1900 or so) and the
latency was larger, around 10-15 seconds, but not in the 2 minute
range. Then I synced my test install with the current production
nagios install and ran the 2200 checks. Then the latency jumped
through the roof to 2 minutes which is 66% of the median polling
interval.
Maybe it's an artifact of the scheduling process and how the service
check interleaving occurs. I can't see nagios3's host polling changes
making a difference though because in my scenario, it only took one fast
ping to verify that the host was up, and all the nagios3 polling
changes do is to run a number of host checks in parallel, so the delay
would be the same.
>> If the failure (w/ exit code 127) would trigger host checking, should
>> the logic change to do host checking only when the plugin exits with a
>> status in the [0-3] range since it is an invalid exit code?
>
>Until now the nagios law is: A possible failure is a non-OK State.
>The exit codes are under control of each plugin.
Well yes, but the only valid exit codes for a plugin that have any
meaning to nagios are 0, 1, 2 and 3. Any plugin that returns a value
outside that range is broken.
>As long as each plugin exits in the defined return code range all is ok.
>Why do you think there should be an exception for exit code 127?
Because 127 is well outside of the "defined return code range", and I
propose the host check logic be disabled not just for exit code 127
but for any exit code > 3. However I am not wedded to the idea.
However, I think having my 10 processes fail with exit code 2 would
also throw the latency through the roof, which is worrisome. It seems
the average latency should be some predictable function of:
total number of services
number of services in non-ok state
This problem/example is making me wonder how predictable that function
is.
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
More information about the Developers
mailing list