coredumps in wobbly networks
Andreas Ericsson
ae at op5.se
Thu Mar 24 12:32:11 CET 2005
Ahoy.
I've observed a series of most unfortunate SIGSEGV's in Nagios.
It appears to happen when service checks pop back to OK on the second
attempt and then something happens (see logs below).
Here are two separate log-entries leading up to the crash. They are
taken from two separate nagios instances on separate machines and, as
you can see by the timing, both instances occurred on different timings
(the naglog program used to get human-readable time is available at
http://oss.op5.se/nagios/naglog.c)
[ crash 1, on primary server ]
2005-03-20 22:11:57: Auto-save of retention data completed successfully.
2005-03-20 22:25:56: SERVICE ALERT: foo-host;PING;WARNING;SOFT;1;WARNING
- x.x.x.x: rta 107 ms, lost 0%
2005-03-20 22:26:56: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
x.x.x.x: rta 1.82 ms, lost 0%
[ crash 2, on secondary server ]
2005-03-21 06:19:41: Auto-save of retention data completed successfully.
2005-03-21 06:28:11: SERVICE ALERT: foo-host;PING;WARNING;SOFT;1;WARNING
- x.x.x.x: rta 234.926ms, lost 0%
2005-03-21 06:29:11: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
x.x.x.x: rta 0.150ms, lost 0%
Note the "PING;OK;SOFT;2" part. These are the last two log-entries
before the crash (it's the same host both times, actually) on both
servers. host check command is standard and there are no problems with it.
It's worth pointing out that this isn't latest CVS, but rather whichever
one was latest Jan 19 2005. I haven't seen a checkin that touches this
codesection though, so I believe the bug might still be lurking in there
somewhere.
The coredumps for these crashes are largely useless. The backtrace
points to __glibc_malloc() called from pthread_create().
pthread_create() is called with a NULL argument, and the coredump
actually takes place at address 0x0.
Here's some of the gdb output (I still have binaries and several
core-files in case anyone's interested in running more commands).
[ gdb session, core1 ]
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
#0 0x00000000 in ?? ()
(gdb) bt
#0 0x00000000 in ?? ()
#1 0x001c100b in __libc_malloc (bytes=512) at malloc.c:2695
#2 0x080612fe in service_result_worker_thread (arg=0x0) at utils.c:4692
#3 0x00162de2 in pthread_start_thread (arg=0xbf5ffe40) at manager.c:241
#4 0x0020f70a in thread_start () from /lib/libc.so.6
(gdb)
[ end gdb session, core1 ]
The gdb session for core2 is identical.
I'll investigate some more during the holidays and see if I can come up
with a patch for this or at least some means of debugging it a bit more
easily.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Lead Developer
-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content. Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
More information about the Developers
mailing list