Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2

Andreas Ericsson ae at op5.se
Thu Jan 12 01:52:00 CET 2006
Previous message: Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2
Next message: Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ben Miller wrote:
> Greetings,
> I am seeing a strange behavior with Nagios that appears to be a
> threading issue.  I have trouble shot this enough to determine that it
> may be over my head and have to do with how threads are handled in
> Nagios or the libraries it uses.  I believe this to be a code level
> issue so I am posting to the devel list vs the user list.  Please
> forgive if this is the wrong place.
> 

This is most certainly the right place, unless we find it to be a bug in 
a library you use, but in that case it's sort of the right place anyway 
since the program isn't behaving as per documentation.


> -Symptom
> When I run Nagios it takes about 30 - 60 seconds to load saved state
> information such as scheduled down times, etc. and it takes upwards of
> 60-120 seconds to process external commands.  In addition, the check
> queue stacks up because it is only processing one check at a time.  A ps
> shows ONLY the main Nagios process, a single child, and that child
> spawning the check command.  It appears as if nothing else (external
> commands, notifications, etc) is being processed while the one child
> task is working.
> 

I'm not sure, but it's most likely due to one of two reasons;

* A plugin that's being run is stuck in uninterruptable IO. This can 
happen when you're trying to check a partition residing on a network 
mounted media where the network connection for some reason is down. It 
can also happen under spurious circumstances where a process with higher 
priority is holding a lock on some resource that the plugin is trying to 
use.

* There's a bug in Nagios causing it to hold a mutex in one of the 
parents' threads that isn't released before the child is spawned, so the 
child inherits the mutex but has no way of releasing it. I know for a 
fact that Nagios does things considered illegal for multithreaded 
programs after fork()'ing, so this might be it. It should work well 
under Linux with reasonably up-to-date libraries and kernel though, but...


> During troubleshooting, I ran Nagios in an strace to determine what it
> was blocking on and I can clearly see that it is stopping during a
> "wait4(" on the pid of the checking or alerting child.
> 

What version of plugins are you running? Which check is running when it 
hangs?


> I ran an strace -f on nagios to see the full thread flow of what was
> happening and Nagios performed perfectly.  The problem went away and
> external checks were processed in a few seconds and ps shows a list of
> half a dozen or so check or alert child processes.
> 
> In addition, when I compile with all debugging turned on and ran Nagios
> by itself, the bad behavior was back.  However when I run the debug
> executable through strace (with NO -f) the process starts up
> excruciatingly slowly, but then runs properly with multiple child
> processes and handling external commands properly.
> 

So in essence it always happens when you run Nagios, no matter how you 
compiled it, but never when you're running it from strace?


> The problem occurs consistently and is easy to replicate.  It occurs
> with versions 2.0b3 or rc2.  I have tested both.
> 

Have you tried this with 2.0rc1 or 2.0rc2 ?
Do you get any messages in the nagios.log saying something like:

service_result_worker_thread: poll(): (text-rep of errno) ?


> -Background
> I have been running Nagios with the same version on a different box with
> the exact same compile options and config files for months and
> everything is working fine.  I am upgrading from a AMD 32 bit system
> (RedHat Enterprise v4) to a new box with Dual 64 bit Opterons running
> (RedHat Enterprise v4 64bit).
> 

Are you going to do this upgrade or have you already done it? Was the 
kernel compiled with a 64-bit compiler? Was glibc and the thread-library 
compiled with a 64-bit compiler? What versions of kernel, glibc and 
thread-library are you using? What flavour of thread-library are you 
using (linux-threads or nptl)?

> I compile with: ./configure --prefix=/home/nagios/nagios
> --with-cgiurl=/nagios/cgi-bin --with-nagios-user=nagios
> --with-nagios-group=nagios --with-htmurl=/nagios --with-perlcache
> --enable-embedded-perl
> 

Try disabling embedded perl. When embedded perl is enabled (particularly 
with caching), the routine Nagios goes through after the fork() call is 
quite frankly so thread-unsafe that it's a miracle that it works 
anywhere at all.

> It seems that there might be a thread/race/timing issue that is relieved
> when there is enough debugging or if strace is involved in the thread
> handling.
> 

A Heisenbug... Nasty stuff. Running things through strace unfortunately 
causes different rules to apply for signals and mutexes (strace reads 
the output of the child-process directly, so there is less locking going 
on), and since it runs a lot slower mutexes that would possibly have 
been held if it weren't for strace have time to be released prior to the 
fork() call.


> I can provide more information if there is someone(s) who can help me
> resolve this issue.


Homework one is to come up with answers for all those questions I asked.

Fix-attempt one is to try the newest release of Nagios available. In 
particular I think you'll need the patch I submitted 2005-05-05 (after 
2.0b3 was released), which adds a couple of flag-macros that's supposed 
to alter the behaviour of the C pre-processor somewhat.

Fix-attempt two is to try re-compiling with embedded perl and the 
perl-cache disabled.

Keep us posted, will you?

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
Previous message: Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2
Next message: Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list