BUG/PATCH: Runaway processes under Linux (and others)
bruce
nagios-devel at vicious.dropbear.id.au
Thu Apr 27 11:49:10 CEST 2006
On Thu, 27 Apr 2006, Andreas Ericsson wrote:
> bruce wrote:
> Anyways, this:
>
>
> + /* exit with a dirty feeling */
> + static void signal_exit( void ){
> + _exit(1);
> + }
> +
>
> is wrong. The prototype for signal handlers must be
>
> void signal_exit(int signum);
>
> The static keyword is ofcourse optional and valid.
>
> Otherwise it looks like a good patch.
Ah. I bow to your greater C-Fu ;). Duly edited and applied on my working
copy.
>> On some systems, a rarer problem shows itself, making the solution to the
>> Nagios issue somewhat harder. This problem is when a child process,
>> inheriting the parent's signal handlers, receives a signal (usually
>> SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's
>> lock/pid file. Thus, one no longer knows which process is the legitimate
>> parent process.
>
> If nagios' grandchildren (the ones that popen() commands) receives SIGCHLD
> from anything but the check it's running something is very, very wrong with
> the system you're using. Are you perhaps using the old and deprecated
> NGPT-library?
The grandchild occurs in run_system_checks(), and I haven't caught child
processes created from that segment of code removing the lock file,
although this may be unwillingness on my part to fully match up the debug
output ;). ( For the record, the thread library used according to
'getconf GNU_LIBPTHREAD_VERSION', is 'NPTL 2.3.6' ).
The lock removal instead seems to be occuring with the child process
created in my_system(), which sometimes stalls at a point before the
signal handlers get reset (or they don't get reset, my debugging
statements weren't fine-grained enough). When the parent sends a TERM
signal to the child when it is in this state (due to timeout), the child
runs the signal handlers inherited from the parent, removing the lock
file.
>> With these patches on, the rate of stray process creation has dropped, but
>> I am still seeing occasional orphaned processes around;
Overnight, I had one machine fail due to the death-by-nibbles problem,
which due to its location and sudden lack of boot sector, will be a
two-banana fix. As an interim fix, the remaining machines are now
restarting Nagios every two hours from cron, although this smacks of
inelegance.
>> ie, I've fixed some
>> of the symptons, but not the actual cause. That will take some more
>> rewrites.
>
> Yup. The choice of a FIFO pipe for passing check-results back to the master
> process was unfortunately a bad one which is now irrevocable without major
> code-surgery.
Yes. It has scaling issues which do not show themselves in small
installations (say, under 100 service checks).
--
Bruce Campbell
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
More information about the Developers
mailing list