Unexplained nagios crashes
Andreas Ericsson
ae at op5.se
Tue Aug 21 10:45:11 CEST 2007
What thread-library is the customer using (make, model, version, everything...)?
What's the uname -a output?
If Linux, which scheduler is being used in the kernel?
Duncan Ferguson wrote:
> Hiya Ethan, list.
>
> We are hoping someone may be able to help diagnose what is going on
> with an obscure problem we have. After going cross-eyed from looking
> at this over the last few weeks I thought it best to see if anyone
> else has seen/experienced the same thing.
>
> We have a single customer that has been suffering sporadic nagios
> daemon crashes since June - nothing is unique about their set up that
> we have been able to find and other customers have the exact same
> binaries (and distributed setup with same number of slaves) on the
> same OS and have had no crashes in the same period of time.
>
> Salient points:
> * this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils
> broker module and an in house broker module
> * the crashes are intermittent and irregular, at no fixed time of
> day. Might have three crashes one day, then nothing for two days,
> then one crash a day for four days
> * Studying the core dump, the code bombs out in
> commands.c:process_passive_service_checks while transversing the
> passive_check_result_list linked list
>
> We have added in a bit of extra code to print out the entire
> passive_check_result_list structure before the fork, and from what we
> can see in the core dump the list is corrupted mid way through - the
> last readable record has a 'next' pointing to what looks like a valid
> area of memory, but nothing is there, but
> passive_check_result_list_tail has a valid entry which implies
> everything was added into the list OK in the first place.
>
> So between being added into the linked list and being read from the
> linked list a record is removed. The list has well below maximum
> number of buffer slots so lack of memory isnt the problem (else the
> tail entry would also be screwed).
>
> We have been unable to find any code that would cause this behavior
> (especially when the list is confined to commands.c), especially when
> this section is called and used as often as it is and the crashes few
> and far between (in comparison).
>
> The nagios binary has been compiled with "-ggdb -O0" for debugging
> purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz
> cpu's and 4Gb of memory. The core dump, nagios binary and commands.c
> is available at http://resources.opsview.org/nagios_crash.tar.gz
>
> Any insight or help would be appreciated.
>
> Duncs
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list