Unexplained nagios crashes

Andreas Ericsson ae at op5.se
Mon Aug 27 13:19:31 CEST 2007


Duncan Ferguson wrote:
> On 21 Aug 2007, at 09:45, Andreas Ericsson wrote:
> 
>> What thread-library is the customer using (make, model, version,  
>> everything...)?
> 
> libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0
> 
>> What's the uname -a output?
> 
> Linux pih-altinity01 2.6.18-4-686 #1 SMP Mon Mar 26 17:17:36 UTC 2007  
> i686 GNU/Linux
> 
> Linux version 2.6.18-4-686 (Debian 2.6.18.dfsg.1-12)  
> (waldi at debian.org) (gcc version 4.1.2 20061115 (prerelease) (Debian  
> 4.1.1-21)) #1 SMP Mon Mar 26 17:17:36 UTC 2007
> 
>> If Linux, which scheduler is being used in the kernel?
>>
> 
> CFQ
> 
> A bit more information on the problem - its the firsst byte of the - 
>  >next that is being corrupted - within gdb by guessing at what that  
> byte might be the rest of the list can be transversed.
> 


One passive_check_result struct is located at a bogus address, just
as you said.

My guess would be that it's an off-by-one somewhere in the code that
only triggers under some very special circumstances. Since it only
happens at one customer site, something needs to be special about
that customer.

Judging by the backtrace of the core-dump, this particular crash
happened after the host "pih-cronhost2" has had a failing service
("MySQL process", by the looks of it) and had its route checked,
followed by an external command being read and processed.
As I can't run the program here (without config files), I can't
step through it and see what the passive message is

It's not much to go on, but if the crash happens again, do check
if it's the same chain of events for the same host that triggers it.

If it is, we'll have something to work with. Until the crash can be
reliably reproduced, any attempts at fixing it will unfortunately
just be blank shots in the dark.

You could try upgrading to the very latest nagios-2-x-bugfixes
off of cvs. It has quite a few bugfixes. Looking at commit-messages,
I can't really say if this particular bug has been fixed though.

Does your in-house neb in any way muck about with the passive
service check result lists?

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list