Unexplained nagios crashes
Duncan Ferguson
duncan.ferguson at altinity.com
Wed Aug 15 17:45:22 CEST 2007
Hiya Ethan, list.
We are hoping someone may be able to help diagnose what is going on
with an obscure problem we have. After going cross-eyed from looking
at this over the last few weeks I thought it best to see if anyone
else has seen/experienced the same thing.
We have a single customer that has been suffering sporadic nagios
daemon crashes since June - nothing is unique about their set up that
we have been able to find and other customers have the exact same
binaries (and distributed setup with same number of slaves) on the
same OS and have had no crashes in the same period of time.
Salient points:
* this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils
broker module and an in house broker module
* the crashes are intermittent and irregular, at no fixed time of
day. Might have three crashes one day, then nothing for two days,
then one crash a day for four days
* Studying the core dump, the code bombs out in
commands.c:process_passive_service_checks while transversing the
passive_check_result_list linked list
We have added in a bit of extra code to print out the entire
passive_check_result_list structure before the fork, and from what we
can see in the core dump the list is corrupted mid way through - the
last readable record has a 'next' pointing to what looks like a valid
area of memory, but nothing is there, but
passive_check_result_list_tail has a valid entry which implies
everything was added into the list OK in the first place.
So between being added into the linked list and being read from the
linked list a record is removed. The list has well below maximum
number of buffer slots so lack of memory isnt the problem (else the
tail entry would also be screwed).
We have been unable to find any code that would cause this behavior
(especially when the list is confined to commands.c), especially when
this section is called and used as often as it is and the crashes few
and far between (in comparison).
The nagios binary has been compiled with "-ggdb -O0" for debugging
purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz
cpu's and 4Gb of memory. The core dump, nagios binary and commands.c
is available at http://resources.opsview.org/nagios_crash.tar.gz
Any insight or help would be appreciated.
Duncs
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list