Nagios crash, apparently in service_result_worker_thread
Andreas Ericsson
ae at op5.se
Wed Apr 27 22:10:01 CEST 2005
Time for an update on this mess, I think.
The problem only appears to happen on systems with more than one cpu. It
happens more frequently the higher the stepping value of the cpu's. I've
written some seriously detailed debug-session output on the subject and
posted it at http://oss.op5.se/nagios/weird-crash.txt (it's in
"boardroom talk", so it's fairly high-strung).
I've also noticed that the poll(2) system call in
service_result_worker_thread never checks for errors. The man-page
doesn't state specifically what should happen when things warp out and
the kernel sources aren't exactly clear about it, but (according to
kernel-sources) it seems unlikely anything but ENOMEM, EBADF, EFAULT and
EINTR could actually happen.
I haven't debugged or audited the glibc in place on the system, but it
is a fairly old one (2.1.3). libpthread is .so-version 0.8.
I've tried with 4 different kernels but that doesn't seem to help
mitigate the crashes.
It seems that increasing the service_reaper_frequency increases the
amount of time between crashes. I'm not sure why though, but it's fairly
consistent on all the systems where it crashes.
I'll send in a patch in a while to let poll(2) properly check for errors
and at least log them.
Andreas Ericsson wrote:
> Ethan Galstad wrote:
>
>> Andreas -
>>
>> Did you have any luck(?) in having this happen again, so as to be able
>> to track it down?
>
>
> It's been happening with irregular interval ever since. Always with the
> same backtrace, but not always on friday nights any more.
>
> Here's how it's compiled;
> CFLAGS="-pipe -march=i386 -mcpu=i686 -O2 -momit-leaf-frame-pointer
> -mpreferred-stack-boundary=3 -ggdb3 -g"
> export CFLAGS
> ./configure --prefix=/opt/monitor --disable-statuswrl
> --with-nagios-user=monitor --with-nagios-group=httpd --disable-event-broker
>
> Notable is that the customers to which this has happened are running
> lots of custom plugins and I'm not sure whether those kill themselves in
> a timely manner or not. A few other customers also do this, but they
> don't seem to be affected at all. All other software is identical on all
> systems.
>
> I'll send a patch to allow coredumps in a clean way.
>
> Cheers.
>
>>
>>
>> On 20 Dec 2004 at 11:40, Andreas Ericsson wrote:
>>
>>
>>> gdb nagios core
>>>
>>> (gdb) bt
>>> #0 0x00000000 in ?? ()
>>> #1 0x001c100b in __libc_malloc (bytes=512) at malloc.c 2695
>>> #2 0x08060971 in service_result_worker_thread(arg=0x0) at utils.c:4666
>>> #3 0x00162de2 in pthread_start_thread(arg=0xbf5ffe40) at manager.c:241
>>> #4 0x0020f70a in thread_start () from /lib/libc.so.6
>>> (gdb)
>>>
>>> What strikes me as weird is the fact that this crash happened after
>>> Nagios had been running for 4 days (and always seems to happen at
>>> friday nights between 9 PM and 11:30 PM in this particular network). I
>>> would have expected service_result_worker_thread() to fail at
>>> start-time, if at all.
>>>
>>> Mind though, I've made some modifications to allow it to dump core
>>> (which should be either default, ./configure-able or a command
>>> argument since debugging without it is not nearly as efficient, and
>>> "ulimit c none" can be used to prevent it from doing so any way), but
>>> only very minor such that shouldn't affect stability at all.
>>>
>
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Lead Developer
-------------------------------------------------------
SF.Net email is sponsored by: Tell us your software development plans!
Take this survey and enter to win a one-year sub to SourceForge.net
Plus IDC's 2005 look-ahead and a copy of this survey
Click here to start! http://www.idcswdc.com/cgi-bin/survey?id=105hix
More information about the Developers
mailing list