FreeBSD and Nagios 2.0

Christophe Yayon lists at nbux.com
Tue Nov 15 19:24:31 CET 2005


Hi all,

I have already posted some question about this threads implementation 
problem. I have experienced problem with FreeBSD 5.3-STABLE, 5.4-STABLE 
(lastest). I didn't try with the last 6.0 release... i have switched to 
linux ;-)

Here is some post on freebsd-hackers mailing list, if it could help ...


---------------------------------
 >> Hi again,
 >>
 >> After some discussions on freebsd-hackers mailling list, here is a 
resume :
 >>
 >> 1. There a recommendation (or a suggestion) for what to do after a 
fork() :
 >> 
http://www.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html
 >> "In other words "It is suggested that programs that use fork() call an
 >> exec function very soon afterwards in the child process, thus resetting
 >> all states. In the meantime, only a short list of async-signal-safe
 >> library routines are promised to be available."
 >> Note *suggested*. This is a recommendation to protect against a shoddy
 >> pthread-implementation. The thread specifications rule that only the
 >> thread calling fork() is duplicated, which initially leads to the
 >> recommendation (other threads holding locks aren't around to release 
them
 >> in the new execution context).
 >>
 >>
 >> 2. it appears that Nagios do after a fork :
 >> in base/util.c:
 >>         (1) Become the process group leader by calling setpgid(0, 0);
 >>         (2) something called set_all_macro_environemt_vars(TRUE).
 >>             This calls snprintf a bunch, as well as set variables
 >>             by saving them to malloced memory.  This save is done
 >>             with strcpy and strcat.  setenv is then called to try to
 >>             export them.  memory is then freed with free(3).
 >>         (3) All signal handlers are reset
 >>         (4) The right part of the pipe is closed
 >>         (5) sigalarm handler is created and an alarm set.
 >>         (6) Checks to see if it executing an embedded perl script,
 >>             then tries to execute it if so.  This has the feel of
 >>             being too much after the fork.
 >>         (7) Calls popen on the command if not.
 >>         (8) Reads the output of the command using fgets.
 >>         (9) closes the other end of the pipe
 >>         (10) unsets all env vars.
 >>         (11) Calls _exit()
 >>
 >> in base/checks.c
 >>         (1) set_all_macro_environment_vars(TRUE)
 >>         (2) forks again
 >>         (3) granchild:
 >>                 resets handler, setpgid, etc.
 >>                 if perl script, do embedded perl, otherwise popen.
 >>                 lots of read/write to pipe.
 >>
 >> likewise in base/commands.c fork is also called for similar things.
 >> There's other places that also call popen...
 >>
 >>
 >> 3. You can only execute async-signal-safe functions after a fork()
 >> from a threaded application.  free(), malloc(), popen(), fgets(),
 >> are not async-signal-safe.


In a proper implementation they are. Read malloc/malloc.c from
glibc-2.3.5 and you'll see. The first line of it reads

"/* Malloc implementation for multiple threads without lock contention"

fgets() must also be async-safe, since it's passed its storage-buffer
from the calling function. It can contain races if several threads (or
programs for that matter) tries to read FIFO's at the same time or are
trying to store things to the same piece of memory, but that's neither
new, strange or in any way non-obvious. Obviously, fgets() relies on
lower-level IO code which must be thread-safe (read() in this case) on
account of them being syscalls inside multitasking kernels.

popen() forks and calls execve immediately. If this isn't thread-safe
then there's no way of executing external programs in multithreaded
applications short of implementing popen() directly (which isn't exactly
difficult, but still).


 >>  The list of async-signal-safe functions
 >> are here: http://www.opengroup.org/onlinepubs/009695399/nframe.html
 >> The restriction on fork() is here (20th bullet down):
 >> http://www.opengroup.org/onlinepubs/009695399/nframe.html
 >>


Both of those links point to the same document, which is just the
frameset for the navigation-frames.

For async-safe functions, this is the proper url;
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html#tag_02_09_01

For the fork() specification, the doc is here;
http://www.opengroup.org/onlinepubs/009695399/functions/fork.html

The 20'th bullet is this;
-----------
"A process shall be created with a single thread. If a multi-threaded
process calls fork(), the new process shall contain a replica of the
calling thread and its entire address space, possibly including the
states of mutexes and other resources. Consequently, to avoid errors,
the child process may only execute async-signal-safe operations until
such time as one of the exec functions is called. [THR] [Option Start]
Fork handlers may be established by means of the pthread_atfork()
function in order to maintain application invariants across fork()
calls. [Option End]

When the application calls fork() from a signal handler and any of the
fork handlers registered by pthread_atfork() calls a function that is
not asynch-signal-safe, the behavior is undefined."
-----------

Also note that "From the application's perspective, a fork() call should
appear atomic." which implicitly states fork() as an async-safe function
although the following execution may not be. It also warns that improper
implementations makes it less so.



 >>
 >> 4. Some FreeBSD developpers think that handling fork() in libpthread 
(and
 >> probably libthr) than was done in libc_r.  We thought it better not 
to try
 >> and reinitialize libpthread (and to some extent libc) because
 >> it is messy and to expose non-portable applications.
 >>


This is funny, because nagios apparently runs properly on Linux, HPUX,
Solaris, Irix, AIX and Tru64. To me that seems to indicate that Nagios
is very portable indeed and that the BSD fellows somehow botched it. I
might be wrong, but...


 >>
 >>
 >> Possibles solutions :
 >>
 >> a. (the best, i think) Trying to modify Nagios code to respect the
 >> recommendation (1.). We are talking about portability and not
 >> performance...
 >>


This would involve a fairly large change in the way things are done. I
for one am all for implementing a different parallelisation mechanism
but I'm fairly certain Ethan won't be too thrilled if I rewrite 40% of
the code that's currently the Nagios core.


 >> b. a possible workaround for Nagios FreeBSD (and i think other Unix
 >> systems, except Linux) is to use another threads library. For FreeBSD it
 >> seems that uising GNU/pth (which is in the ports) seems to completely
 >> resolve the problem (but i think it's ugly to have to use another -not
 >> native- threads lib...).
 >>
 >>
 >>
 >> What do you think about this ?



In summary; Some thread-libraries work while others don't (the native
*BSD one being the only one that doesn't), I'd say it's time to fix that
thread-library, although I favor the rewrite-nagios approach as an
exercise in intellectual masturbation and would be quite willing to do
the actual work of it, provided I can be somewhat sure it isn't wasted.
-----------------------
This posting demonstrates a fundamental confusion between thread-safe 
and async-safe.  That is the root of the problem in the communication. 
Thread-safe functions are a dime a dozen and relatively easy to write. 
async-safe functions are very rare and much harder to do useful things 
with.  I've tried to explain the difference below using fgets() as an 
example of the difficulties.

 > > fgets() must also be async-safe, since it's passed its storage-buffer
 > > from the calling function. It can contain races if several threads (or
 > > programs for that matter) tries to read FIFO's at the same time or are
 > > trying to store things to the same piece of memory, but that's neither
 > > new, strange or in any way non-obvious. Obviously, fgets() relies on
 > > lower-level IO code which must be thread-safe (read() in this case) on
 > > account of them being syscalls inside multitasking kernels.

fgets need not be async-safe, but it does need to be thread-safe.
When one fork after pthread_create, one may only call async-safe
functions.  The weaker requirements of thread safety can be shown to
not necessarily be async safe.  If two different threads call fgets(),
mutexes will keep one thread from running if the other is in the
middle of changing the FILE * internal state.  However, if that thread
is interrupted by the scheduler with the mutex held, and fork() is
called, then the new copy of the address space will still have that
mutex held.  Any attempt by this new process, with its own address
space, to acquire the lock is doomed to failure.  Since the parent and
child execute in different address spaces, there is no way for a
thread that does not exist in the child to unlock the locked mutex.


Normally this happens like so:

	Thread A				Thread B

	fgets(fp, b1, 10);
		lock fp's mutex
		copy 5 available bytes into b1
<thread scheduler interrupts here>
						fgets(fp, b2, 10)
						try lock fp's mutex
<thread scheduler puts on the pending list, maybe resuming A>
		unlock fp's mutex
	return
<thread scheduler wakes up B>
						attempt to lock finishes
						b2 can be updated
						unlock mutex.

However, in the fork case:

	Thread A				Thread B

	fgets(fp, b1, 10);
		lock fp's mutex
		copy 5 available bytes into b1
<thread scheduler interrupts here>
						fork()
	<thread A is now gone in child>
						fgets(fp, b2, 10)
						try lock fp's mutex
At this point B', the only thread in the child, will never be able to
grab this lock because A exists only in the parent and the
parent/child have independent address spaces.

While the above example is not what nagios is doing, it illustrates
the point.  There are some functions that necessarily touch global
state.  These functions need to coordinate that touching of state.  If
one of the is interrupted with locks held, then all bets are off of a
program forks and the threads holding those locks can never unlock
them.

 > >  >>  The list of async-signal-safe functions
 > >  >> are here: http://www.opengroup.org/onlinepubs/009695399/nframe.html
 > >  >> The restriction on fork() is here (20th bullet down):
 > >  >> http://www.opengroup.org/onlinepubs/009695399/nframe.html
 > >
 > > Both of those links point to the same document, which is just the
 > > frameset for the navigation-frames.
 > >
 > > For async-safe functions, this is the proper url;
 > > 
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html#tag_02_09_01

This reference is for thread-safe functions.  You are confusing
thread-safe and async-safe.  The correct url for async-safe is

http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03

 >> >> The following table defines a set of functions that shall be either
 >> >> reentrant or non-interruptible by signals and shall be
 >> >> async-signal-safe. Therefore applications may invoke them, without
 >> >> restriction, from signal-catching functions:
 >> >>	<list omitted, since it has been posted before>

Notice that this list is very short, and there are many functions that
one would think should be on here, but in fact aren't.

 > > For the fork() specification, the doc is here;
 > > http://www.opengroup.org/onlinepubs/009695399/functions/fork.html
...
 > > "A process shall be created with a single thread. If a multi-threaded
 > > process calls fork(), the new process shall contain a replica of the
 > > calling thread and its entire address space, possibly including the
 > > states of mutexes and other resources. Consequently, to avoid errors,
 > > the child process may only execute async-signal-safe operations until
 > > such time as one of the exec functions is called.

Notice here it says specifically 'async-sngial-safe operations' not
'thread-safe' operations.  The standard explicitly calls attention to
the difficulties and differences between these two types of functions.

 > > This is funny, because nagios apparently runs properly on Linux, HPUX,
 > > Solaris, Irix, AIX and Tru64. To me that seems to indicate that Nagios
 > > is very portable indeed and that the BSD fellows somehow botched it. I
 > > might be wrong, but...

Just because it works doesn't make it standards conforming.

Maybe there's some simple extension that can be implemented to help
the situation.  However, the inflamitory language gets very much in
the way of having a technical discussion.

Warner
----------------------------



-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click




More information about the Developers mailing list