Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2
Ben Miller
bgmiller at nframe.com
Thu Jan 12 05:25:03 CET 2006
My latest tests and findings:
a) disable embedded perl and perl-cache
I did this and the results were exactly the same as before
b) move /home to local volume
Again, the results are the same as before, no improvement.
c) get cvs version of nagios and try it (with the above two changes in
place) If it works, I will reverse b then a and see where/if it breaks.
I downloaded the snapshot and still the behavior is the same as
originally described.
During these tests I observed the following behavior.
The threading seems to startup ok and I see the proper number of checks
occurring. I have a lot that are snmp checks. When the first
check_ping process starts I see the following process tree and slowly
the other checking threads die off until only one thread remains. The
remaining thread is the check_ping thread. When it finally completes,
only one check at a time is performed from then on. This seems to
support you thought that a child process blocking the parent somehow.
29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
30056 pts/1 S 0:00 \_ ../bin/nagios nagios.cfg
30057 pts/1 S 0:00 \_ /home/nagios/nagios/libexec/check_ping
-p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
30058 pts/1 S 0:00 \_ /bin/ping -n -U -w 16 -c 10
192.168.10.10
I upgraded to the latest plugins and this behavior remains. Somehow
strace -f seems to handle the check_ping blockage and let the app behave
properly
I am out of ideas of what to test next. Does this evidence help? What
is the next step?
Thanks again,
Ben
-----Original Message-----
From: nagios-devel-admin at lists.sourceforge.net
[mailto:nagios-devel-admin at lists.sourceforge.net] On Behalf Of Ben
Miller
Sent: Wednesday, January 11, 2006 9:35 PM
To: Andreas Ericsson
Cc: nagios-devel at lists.sourceforge.net
Subject: RE: [Nagios-devel] Nagios blocking/stalling: Thread issue? v
2.0b3 or 2.0rc2
Andreas,
Thank you for your insight!
> I'm not sure, but it's most likely due to one of two reasons;
>
> * A plugin that's being run is stuck in uninterruptable IO. This can
> happen when you're trying to check a partition residing on a network
> mounted media where the network connection for some reason is down. It
> can also happen under spurious circumstances where a process with
higher
> priority is holding a lock on some resource that the plugin is trying
to
> use.
>
> * There's a bug in Nagios causing it to hold a mutex in one of the
> parents' threads that isn't released before the child is spawned, so
the
> child inherits the mutex but has no way of releasing it. I know for a
> fact that Nagios does things considered illegal for multithreaded
> programs after fork()'ing, so this might be it. It should work well
> under Linux with reasonably up-to-date libraries and kernel though,
but...
>
I did leave out a valuable bit of information. The /home directory
itself is nfs mounted on the box running nagios. The nagios binaries
reside on the mount itself. In light of your suggestion, my very next
test will be to copy /home locally and eliminate this variable.
However, I do no see nay processes in the ps list that show as
uninterruptible or disk-wait.
> What version of plugins are you running? Which check is running when
it
> hangs?
Running plugins of: nagios-plugins-1.4
Typically the plugin that I see running is a check_ping. However due to
the high number of retries and packets I have check_ping set to make, it
takes a good 30 seconds or more of pinging before it returns failure.
The hosts I am trying to hit are behind a firewall that drops my pings
so the host is seen as down. I have done the same tests from a system
that does have permission to ping the hosts, but the problem still
exists, it is just not as obvious. I wanted to work on a system that
showed the problem as obviously as possible when it was broken.
> So in essence it always happens when you run Nagios, no matter how you
> compiled it, but never when you're running it from strace?
The problem occurs no matter how I compile nagios, when running nagios
by itself.
The problem occurs when I run non-debugged nagios with "strace"
The problem is fixed when I run non-debugged nagios with "strace -f"
The problem is fixed when I run debugged nagios with "strace"
> Have you tried this with 2.0rc1 or 2.0rc2 ?
I have not tried these versions.
> Do you get any messages in the nagios.log saying something like:
> service_result_worker_thread: poll(): (text-rep of errno) ?
I see no messages like this at all in the nagios.log
> Are you going to do this upgrade or have you already done it?
I have the old system running the exact same configs still in place
> Was the kernel compiled with a 64-bit compiler?
I assume so. I am using standard 64-bit RedHat kernels
> Was glibc and the thread-library compiled with a 64-bit compiler?
I assume so. I am using stock libraries distributed with RH.
> What versions of kernel, glibc and thread-library are you using?
Kernel: 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64
x86_64 x86_64 GNU/Linux
Glibc: glibc-2.3.4-2.13
This is the information I have about pthread on my system
/lib64/tls/libpthread-2.3.4.so
/lib64/tls/libpthread.so.0
/lib64/libpthread-0.10.so
/lib64/libpthread.so.0
> What flavour of thread-library are you using (linux-threads or nptl)?
I don't know the answer to this.
> Try disabling embedded perl. When embedded perl is enabled
(particularly
> with caching), the routine Nagios goes through after the fork() call
is
> quite frankly so thread-unsafe that it's a miracle that it works
> anywhere at all.
Ok, I will put this on my list of trials.
> A Heisenbug... Nasty stuff. Running things through strace
unfortunately
> causes different rules to apply for signals and mutexes (strace reads
> the output of the child-process directly, so there is less locking
going
> on), and since it runs a lot slower mutexes that would possibly have
> been held if it weren't for strace have time to be released prior to
the
> fork() call.
Sigh . .. yup, as ugly as it comes.
> Homework one is to come up with answers for all those questions I
asked.
Done
> Fix-attempt one is to try the newest release of Nagios available. In
> particular I think you'll need the patch I submitted 2005-05-05 (after
> 2.0b3 was released), which adds a couple of flag-macros that's
supposed
> to alter the behaviour of the C pre-processor somewhat.
>
> Fix-attempt two is to try re-compiling with embedded perl and the
> perl-cache disabled.
I think I will try this order:
a) disable embedded perl and perl-cache
b) move /home to local volume
c) get cvs version of nagios and try it (with the above two changes in
place) If it works, I will reverse b then a and see where/if it breaks.
> Keep us posted, will you?
Absolutely. Thank you for your suggestions,
Ben
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=ick
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
More information about the Developers
mailing list