Unexplained nagios crashes
Steffen Poulsen
step at tdc.dk
Tue Aug 21 15:54:31 CEST 2007
> -----Oprindelig meddelelse-----
> Fra: nagios-devel-bounces at lists.sourceforge.net
> [mailto:nagios-devel-bounces at lists.sourceforge.net] På vegne
> af Andreas Ericsson
> Sendt: 21. august 2007 10:45
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Unexplained nagios crashes
>
> What thread-library is the customer using (make, model,
> version, everything...)?
> What's the uname -a output?
> If Linux, which scheduler is being used in the kernel?
>
>
>
> Duncan Ferguson wrote:
> > Hiya Ethan, list.
> >
> > We are hoping someone may be able to help diagnose what is going on
> > with an obscure problem we have. After going cross-eyed
> from looking
> > at this over the last few weeks I thought it best to see if anyone
> > else has seen/experienced the same thing.
> >
> > We have a single customer that has been suffering sporadic nagios
> > daemon crashes since June - nothing is unique about their
> set up that
> > we have been able to find and other customers have the exact same
> > binaries (and distributed setup with same number of slaves) on the
> > same OS and have had no crashes in the same period of time.
> >
> > Salient points:
> > * this is using a patched nagios 2.8 binary, a patched
> 1.4b2 ndoutils
> > broker module and an in house broker module
> > * the crashes are intermittent and irregular, at no fixed
> time of day.
> > Might have three crashes one day, then nothing for two
> days, then one
> > crash a day for four days
> > * Studying the core dump, the code bombs out in
> > commands.c:process_passive_service_checks while transversing the
> > passive_check_result_list linked list
> >
> > We have added in a bit of extra code to print out the entire
> > passive_check_result_list structure before the fork, and
> from what we
> > can see in the core dump the list is corrupted mid way
> through - the
> > last readable record has a 'next' pointing to what looks
> like a valid
> > area of memory, but nothing is there, but
> > passive_check_result_list_tail has a valid entry which implies
> > everything was added into the list OK in the first place.
> >
> > So between being added into the linked list and being read from the
> > linked list a record is removed. The list has well below maximum
> > number of buffer slots so lack of memory isnt the problem (else the
> > tail entry would also be screwed).
> >
> > We have been unable to find any code that would cause this behavior
> > (especially when the list is confined to commands.c),
> especially when
> > this section is called and used as often as it is and the
> crashes few
> > and far between (in comparison).
> >
> > The nagios binary has been compiled with "-ggdb -O0" for debugging
> > purposes and is running on Debian Etch i386 with 4x Intel
> Xeon 1.86Hz
> > cpu's and 4Gb of memory. The core dump, nagios binary and
> commands.c
> > is available at http://resources.opsview.org/nagios_crash.tar.gz
> >
> > Any insight or help would be appreciated.
> >
> > Duncs
> >
> >
> ----------------------------------------------------------------------
> > --- This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX
> and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > _______________________________________________
> > Nagios-devel mailing list
> > Nagios-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>
> --
> Andreas Ericsson andreas.ericsson at op5.se
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
>
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and
> a browser.
> Download your FREE copy of Splunk now >>
> http://get.splunk.com/ _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list