BUG/PATCH: Runaway processes under Linux (and others)
bruce
nagios-devel at vicious.dropbear.id.au
Wed Apr 26 16:46:43 CEST 2006
This relates to a number of issues that people have seen with Nagios and
Nsca running under Linux, having many copies of these daemons running, and
eventually running out of memory, frequently crashing the machine. This
post attempts to summarise the problems for those searching the
archives. If you have an OS/distribution/libraries that are susceptible
to this problem, here is a short summary:
You're screwed.
The problem at heart is that Nagios, and Nsca, use function calls after
forking that are either susceptible to race conditions with other
children, have the possibility of blocking, or cancel pending alarm()s.
Depending on your OS/distribution/libraries, usage of such functions
within a fork()ed child may well mean that the alarm timeouts set simply
do not arrive. The child process will sit in an unknown state for a very
long time.
In the case of Nagios, this has a high chance of occuring after it has
fork()ed twice in base/checks.c->run_service_checks(). The main Nagios
process does not know the PID of the grandchild, and has no checks in
place to kill it after a timeout has elapsed. Thus, if the (grand)child
process just sits around, it will not cleaned up by Nagios.
In Nsca, there is no timeout set by default, and no reaping of child
processes. Thus, the child process can happily sit in an unknown state
for as long as the parent daemon exists. This happens more often when
Nsca is running but Nagios is not, as the contention for the opening
of the dump file, rather than the command pipe, more often results in
blocking.
In practical terms, these two cases manifest themselves as a high number
of Nagios and/or Nsca processes, which are being created at a rate
slightly lower than the freqency of service checks being run/incoming
result submission. Eventually, this will cause a crash, as very few
memory management schemes properly deal with the death-by-tiny-bites
situation.
Since my normal solution of installing a, shall we say, more
POSIX-compliant OS on the monitoring systems isn't valid in this
particular Fedora-loving Linux camp, some other solutions need to be
found.
In the short term, the Nsca issue can be avoided by invoking
'/etc/init.d/nsca restart' from Cron every 5 minutes. A dropped result
every 5 minutes is a comparitively small price to pay. The nsca patch
attached sets up a timeout just after the fork for a new connection, which
solves some of the issues.
On some systems, a rarer problem shows itself, making the solution to the
Nagios issue somewhat harder. This problem is when a child process,
inheriting the parent's signal handlers, receives a signal (usually
SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's
lock/pid file. Thus, one no longer knows which process is the legitimate
parent process.
Tracking down this rare problem (which happens all too often to suit me)
led me to creating the attached Nagios patch, which turns off daemon_mode
right away after forking (so the lock file doesn't get deleted if a stray
signal comes in), resets the signal handlers a bit earlier in the children
(so the parent's signal handlers aren't triggered) and reinstates the
alarm before talking to the parent (rather than no timeout). Overall, I'd
much rather missing test results (and Nagios trying the service check
again) than have my machines being nibbled to death.
With these patches on, the rate of stray process creation has dropped, but
I am still seeing occasional orphaned processes around; ie, I've fixed
some of the symptons, but not the actual cause. That will take some more
rewrites.
--==--
Bruce.
-------------- next part --------------
*** src/nsca.c 2006/04/26 12:56:18
--- src/nsca.c 2006/04/26 13:00:50
***************
*** 254,259 ****
--- 254,264 ----
exit(return_code);
}
+ /* exit with a dirty feeling */
+ static void signal_exit( void ){
+ _exit(1);
+ }
+
/* read in the configuration file */
***************
*** 750,755 ****
--- 755,764 ----
return;
}
else{
+ /* Set up a timeout for our doom */
+ signal(SIGALRM,signal_exit);
+ alarm( 120 );
+
/* child does not need to listen for connections */
close(sock);
}
-------------- next part --------------
*** base/checks.c 2006/04/26 12:47:04
--- base/checks.c 2006/04/26 13:40:15
***************
*** 68,73 ****
--- 68,75 ----
extern int check_service_freshness;
extern int check_host_freshness;
+ extern int daemon_mode;
+
extern time_t program_start;
extern timed_event *event_list_low;
***************
*** 378,383 ****
--- 380,392 ----
/* if we are in the child process... */
else if(pid==0){
+ /* Turn off daemon_mode right away so the lock file is not
+ * deleted. */
+ daemon_mode=FALSE;
+
+ /* reset signal handling */
+ reset_sighandler();
+
/* set environment variables */
set_all_macro_environment_vars(TRUE);
***************
*** 448,454 ****
#endif
/* reset the alarm */
! alarm(0);
/* get the check finish time */
gettimeofday(&end_time,NULL);
--- 457,463 ----
#endif
/* reset the alarm */
! alarm(service_check_timeout);
/* get the check finish time */
gettimeofday(&end_time,NULL);
***************
*** 497,503 ****
pclose_result=pclose(fp);
/* reset the alarm */
! alarm(0);
/* get the check finish time */
gettimeofday(&end_time,NULL);
--- 506,512 ----
pclose_result=pclose(fp);
/* reset the alarm */
! alarm(service_check_timeout);
/* get the check finish time */
gettimeofday(&end_time,NULL);
*** base/utils.c 2006/04/26 12:48:26
--- base/utils.c 2006/04/26 13:16:31
***************
*** 2721,2726 ****
--- 2721,2732 ----
/* execute the command in the child process */
if (pid==0){
+ /* Turn off daemon_mode right away */
+ daemon_mode=FALSE;
+
+ /* reset signal handling */
+ reset_sighandler();
+
/* become process group leader */
setpgid(0,0);
***************
*** 2732,2740 ****
free_memory();
#endif
- /* reset signal handling */
- reset_sighandler();
-
/* close pipe for reading */
close(fd[0]);
--- 2738,2743 ----
***************
*** 2788,2796 ****
/* close pipe for writing */
close(fd[1]);
- /* reset the alarm */
- alarm(0);
-
_exit(status);
}
--- 2791,2796 ----
***************
*** 2842,2850 ****
/* close pipe for writing */
close(fd[1]);
- /* reset the alarm */
- alarm(0);
-
/* clear environment variables */
set_all_macro_environment_vars(FALSE);
--- 2842,2847 ----
More information about the Developers
mailing list