Problems with nrpe2 signals and plugin cleanup

Bill Moran wmoran at collaborativefusion.com
Tue Feb 26 16:12:40 CET 2008


In response to Thomas Guyot-Sionnest <dermoth at aei.ca>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 25/02/08 04:17 PM, Bill Moran wrote:
> > I'm writing a custom plugin for our application that runs under nrpe2.
> > 
> > This bugger deals with a lot of data (potentially several G) thus nrpe2
> > is configured with a large timeout (300s) and it's impractical to keep
> > all the data in RAM, so I'm using temp files.
> > 
> > My problem is that sometimes network problems cause the script to take
> > longer than 300 seconds to run.  In this case, I want to receive an
> > alert, so all is well.  The problem here is that nrpe2 terminates the
> > script so the temp files are left lying around.
> > 
> > In looking for a more elegant solution than having admins clean up
> > temp files manually, or having a cron job clean them up, I tried
> > installing a signal handler in the plugin to guarantee cleanup
> > of the temp files, but it didn't work, so I delved into nrpe2s
> > source a bit to figure out why.  I found that on timeout, nrpe2
> > issues a SIGTERM immediately followed by a SIGKILL.  Since SIGKILL
> > is not catchable, my theory is that the SIGKILL signal arrives
> > before my script has had a chance to run the signal handler for
> > the SIGTERM, thus the cleanup is never done.
> > 
> > So ... I've two questions:
> > 
> > First, does anyone have a suggestion on how to handle this better
> > in the script?
> 
> You should set an alarm and handle it yourself. You could for example
> have your script timeout by itself after 300 seconds, and NRPE
> terminating the script after 350 seconds (or more if it may take longer
> to cleanup). See what Perl plugins do for example...

This makes no sense to me.  If I'm going to repeat timeout functionality
in every plugin I write, why would I use Nagios at all?  Might as well
write everything myself and have each script generate an HTML page as
output ...

That probably sounds ridiculously extreme, but my point is that plugin
timeout is something that every plugin needs.  It doesn't seem like
proper design to force every plugin to handle it with it's own magic
when POSIX has a signalling methodology that allows it to be centralized
in the framework.

Seems like a lot of redundant code.  It'll get my problem solved for
the time being, but I find it a hack, not a solution.

> > Second, I'm curious about the rapid issuance of the TERM/KILL
> > signals.  Is there anything preventing nrpe2 from simply sleep()ing
> > a few seconds between the two signals?  I mean, if I'm willing to
> > wait 300s for success, I'm willing to wait 305s for a clean failure.
> 
> While I agree it doesn't make much sense to TERM and KILL right after,
> the only thing I'd do is remove the TERM. Nagios plugins by design must
> not run indefinitely, so NRPE isn't different. If you sleep between
> both, then how long should it be? This raise many issues, so it's better
> to stick with plugins doing their own timeouts.

How many issues does it raise?  The only one I'm seeing is the "how long
do you sleep between signals" issue.  If there's something I'm missing,
feel free to enlighten me.

The "how long do I sleep" issue is minor.  I can think of two happy
solutions:
1) Add another configuration option to both Nagios and NRPE.
2) (better) make the timeout some fraction of the overall timeout.
   How about t/20+1 ... which means a standard 10s timeout results
   in a 1s wait between term and kill, but a 300s timeout results
   in a 16s pause, hardly unreasonable for a plugin that's expected
   to take up to 300 seconds for success.

If there are other issues I'm missing, I'd love to be enlightened, but
the "what should the timeout be?" issue sounds more like an excuse than
an actual design challenge.

Just my opinions, I suppose.  Thanks for the helpful feedback.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran at collaborativefusion.com
Phone: 412-422-3463x4023

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list