Growing number of orphaned service checks...
Charles Dee Rice
cdrice at pobox.com
Thu Mar 3 19:46:59 CET 2005
Andreas Ericsson wrote:
> What's your plugin_timeout value? It should take care of killing runaway
Do you mean service_check_timeout? I have mine set at 60 seconds. Each
plugin I call (which allows a timeout argument) is also invoked with a
60 second timeout (i.e. "$USER1$/check_ssh -t 60 $HOSTADDRESS$").
I do not see any messages in my log files indicating that plugins are
timing out, nor do any service checks go to "unknown" states from timeouts.
> plugins. This might fail in case the plugin is running as a different
> user than the nagios process though. No +s bits anywhere in the path to
> or on your plugins?
Everything is running as nagios:nagios; all exectuables are owned by
nagios:nagios, and there are no suid/sguid bits set anywhere from / to
the executable path, nor on the executables themselves.
I have seen plugin timeouts happen, in very specific scenarios unrelated to
this issue; those problems were explainable and corrected at the time.
There appears to something special or different about this case which
precludes nagios from detecting the plugins are timing out.
> Definitely not good. This might be due to several master instances
> running simultaneously (an excess master might then reap the check
> results of the actual master process through the waitall() syscall,
> causing the real master never to see the result of the checks). What
> happens if you killall -9 nagios, clean up the garbage and then restart
> it properly from the init script?
Tried that just now. Same behaviour.
> This isn't a load issue, so don't worry about it.
I didn't necessarily think it was specifically related to load, but perhaps
to some other system resource(s) which is in some fashion not allowing
nagios to really "start" a service check, even though it thinks it kicked
one off and the process entry is created in the process table, or causing
some race condition not allowing the service check to complete processing.
I don't know the down-and-dirty details of how nagios manages its service
check calls, so perhaps the kinds of race-conditions I'm fearing might not
even be possible.
> If you feel like it, you could put all of the config up for browsing.
> Make heavy use of sed to obscure sensitive data, like so;
> sed 's/\(address[\t ]*\).*/\1xxx.xxx.xxx.xxx/' object.cfg >
> object.cfg.stripped
I was considering that, but decided it would be a lot of work with all the
internal name and address replacements. :) I'll see if I can get time to
sanitize the files so they don't reveal anything internal to our network
and configuration here and see if I can put them somewhere. I would hope
this isn't really a configuration problem, though -- it "feels like" I
shouldn't be able to mis-configure nagios into this kind of state. This
smells more like a system resource issue or process managment bug. Maybe.
> What version of plugins are you using?
Sorry, I neglected to specify. I'm using 1.4.
> I hope you're aware that linux 2.4.12, 2.4.24 and 2.4.27 all had local
> root exploits, although 24 and 27 only with rather special configurations.
Long story, but the short version is "our group doesn't support the OS on
that machine." :) We support the machines we are using that machine to
monitor, but we are essentially "borrowing" time on this server to run
nagios to watch our machines. I've already expressed concerns with the
somewhat out-dated kernel and distrib on that box, but there's nothing
else I can do about that. I do have root-access to the box, but only use
it for whatever is absolutely necessary for our system-monitoring tasks.
Aside from that, this machine is a "black box" to me.
> I didn't even know the issue existed in 1.2, so I don't think a Nagios
> upgrade will be all that helpful, really. Ofcourse, if you have a spare
> server available you could try running both in parallell for a while and
> see if 2.0 works better for you.
I do not. I might be able to get a downtime where I could take down my
existing 1.2 management server and run a 2.0 build for a short period,
just to test it. That would take some time to schedule, since this is
currently a production system.
> gdb might be a good option. You should recompile your nagios with
> extended debugging symbols in case you decide to use it (-ggdb3) and
> keep the source-files untouched after running ./configure so you can
> follow execution more closely. Enabling the proper DEBUG preprocessor
> directive might also help (I believe there's a special one just for
> debugging checks).
I might have time to try that, but I don't expect it would be until
some time next week. I'll see how things are shaping out schedule-wise
here.
> Other than that I'd say a couple of scripts to examine the logfiles for
> missing checks is your best bet.
I'm not sure what you mean specifically there. You mean just generating a
list of what checks are being missed and verifying nagios is picking them
up as orphaned?
Thanks - Chuck
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list