coredumps in wobbly networks

Andreas Ericsson ae at op5.se
Fri Mar 25 00:54:05 CET 2005


Ethan Galstad wrote:
> Not sure where this is actually happening.  It looks like malloc() is 
> to blame - not sure why.  The only malloc() in the 
> service_result_worker_thread() routine occurs at line 4736 in 
> base/utils.c, which looks ok to me.  
> 
> Anyone else have any ideas as to what might be happening?
> 

malloc() might be called on a mutex locked pthread object, or the object 
isn't locked and the pointer location is changed (or free'd) between the 
actual allocation and the setting of the protection flags. AFAIK gdb 
does a poor job with tracing into syscalls. I'll try and add some more 
verbose debugging for it so I can catch the crash "in flight", so to speak.

> 
> 
> On 24 Mar 2005 at 12:32, Andreas Ericsson wrote:
> 
> 
>>Ahoy.
>>
>>I've observed a series of most unfortunate SIGSEGV's in Nagios.
>>It appears to happen when service checks pop back to OK on the second
>>attempt and then something happens (see logs below).
>>
>>Here are two separate log-entries leading up to the crash. They are
>>taken from two separate nagios instances on separate machines and, as
>>you can see by the timing, both instances occurred on different
>>timings (the naglog program used to get human-readable time is
>>available at http://oss.op5.se/nagios/naglog.c)
>>
>>[ crash 1, on primary server ]
>>2005-03-20 22:11:57: Auto-save of retention data completed
>>successfully. 2005-03-20 22:25:56: SERVICE ALERT:
>>foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 107 ms, lost 0%
>>2005-03-20 22:26:56: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
>>x.x.x.x: rta 1.82 ms, lost 0%
>>
>>[ crash 2, on secondary server ]
>>2005-03-21 06:19:41: Auto-save of retention data completed
>>successfully. 2005-03-21 06:28:11: SERVICE ALERT:
>>foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 234.926ms, lost 0%
>>2005-03-21 06:29:11: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
>>x.x.x.x: rta 0.150ms, lost 0%
>>
>>
>>Note the "PING;OK;SOFT;2" part. These are the last two log-entries
>>before the crash (it's the same host both times, actually) on both
>>servers. host check command is standard and there are no problems with
>>it.
>>
>>It's worth pointing out that this isn't latest CVS, but rather
>>whichever one was latest Jan 19 2005. I haven't seen a checkin that
>>touches this codesection though, so I believe the bug might still be
>>lurking in there somewhere.
>>
>>The coredumps for these crashes are largely useless. The backtrace
>>points to __glibc_malloc() called from pthread_create().
>>pthread_create() is called with a NULL argument, and the coredump
>>actually takes place at address 0x0.
>>
>>Here's some of the gdb output (I still have binaries and several
>>core-files in case anyone's interested in running more commands).
>>
>>[ gdb session, core1 ]
>>Program terminated with signal 11, Segmentation fault.
>>Reading symbols from /lib/libm.so.6...done.
>>Loaded symbols for /lib/libm.so.6
>>Reading symbols from /lib/libnsl.so.1...done.
>>Loaded symbols for /lib/libnsl.so.1
>>Reading symbols from /lib/libpthread.so.0...done.
>>Loaded symbols for /lib/libpthread.so.0
>>Reading symbols from /lib/libc.so.6...done.
>>Loaded symbols for /lib/libc.so.6
>>Reading symbols from /lib/ld-linux.so.2...done.
>>Loaded symbols for /lib/ld-linux.so.2
>>Reading symbols from /lib/libnss_files.so.2...done.
>>Loaded symbols for /lib/libnss_files.so.2
>>#0  0x00000000 in ?? ()
>>(gdb) bt
>>#0  0x00000000 in ?? ()
>>#1  0x001c100b in __libc_malloc (bytes=512) at malloc.c:2695
>>#2  0x080612fe in service_result_worker_thread (arg=0x0) at
>>#utils.c:4692 3  0x00162de2 in pthread_start_thread (arg=0xbf5ffe40)
>>#at manager.c:241 4  0x0020f70a in thread_start () from /lib/libc.so.6
>>(gdb)
>>[ end gdb session, core1 ]
>>
>>The gdb session for core2 is identical.
>>
>>I'll investigate some more during the holidays and see if I can come
>>up with a patch for this or at least some means of debugging it a bit
>>more easily.
>>
>>-- 
>>Andreas Ericsson                   andreas.ericsson at op5.se
>>OP5 AB                             www.op5.se
>>Lead Developer
>>
>>
>>-------------------------------------------------------
>>This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon
>>2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest
>>Windows Embedded(r) & Windows Mobile(tm) platforms, applications &
>>content.  Register by 3/29 & save $300
>>http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
>>_______________________________________________ Nagios-devel mailing
>>list Nagios-devel at lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>
>>
> 
> 
> 
> 
> Ethan Galstad,
> Nagios Developer
> ---
> Email: nagios at nagios.org
> Website: http://www.nagios.org
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click




More information about the Developers mailing list