[Nagios-users] external commands and segfault -- again
Andreas Ericsson
ae at op5.se
Mon Jan 8 18:40:20 CET 2007
bobi at netshel.net wrote:
> Hey Fellow Nagios-ites:
>
> I've been having this *exact* same segfault problem for the last couple o'
> months.
>
> And, after looking at David's stack trace output, it is segfaulting for
> him in the exact same way/place as it is for me.
>
> Here's what I've found:
>
> The core dump's that I've examined are all segfaulting when handling the
> expiration of a scheduled downtime.
>
> Since David's stack trace looks identical to mine, I don't think it is in
> the external command processing, as he believes, but it is in the downtime
> expiration handling, as well.
>
> Having examined about a dozen of these identical core dumps, I see that it
> is a corruption of the entire sheduled_downtime structure that is being
> passed into the handled_scheduled_downtime() function.
>
> The handled_scheduled_downtime() function is being invoked by the high
> priority event processing logic in the event_execution_loop(). So it
> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
> priority event list, and then hands it to handle_timed_event(), which in
> turns invoke the handle_scheduled_downtime() routine to handle the
> expiration of the specified downtime event.
>
> The problem is, the scheduled_downtime structure is already corrupted
> while sitting in the high_priority list - well before it is dequeued by
> the event_execution_loop() logic.
>
> I've walked the high priority list in memory with gdb to examine other
> timed_event structures and have noticed that only the scheduled_downtime
> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
> affected by the memory corruption. In fact, one time, I found nine
> scheduled downtime expiration event sequentially listed in the high
> priority list and the first three had their scheduled_downtime structures
> corrupted and the remaining six were in pristine condition.
>
>
> So, I've narrowed it down to a couple of possibilities (feel free to add
> your own!):
>
> 1. The scheduled_downtime structure is already corrupted when it is being
> added to the high priority timed event scheduling list, or
>
>
> 2. The scheduled_downtime structure is OK when it is added to the high
> priority list, but perhaps a bad pointer access is overwriting it with
> garbage at some other point in the program. This would might be somewhat
> painful to track down.
>
>
> Of the two, I suspect that the second one is the more likely candidate.
>
I think the first, as it only happens with scheduled downtime stuff.
Otherwise you'd see it on other high-prio events as well (unless you're
extremely unlucky each time the crash happens).
>
> Some other notes:
>
> 1. The timed event expirations that segfault Nagios seem to be "randomly"
> chosen.
>
> We have some regularly submitted (via cron) scheduled downtimes that will
> work fine for weeks, and then one of them will come up for expiration and
> trigger this scheduled-downtime-expiration bug. I've also seen it happen
> with ad-hoc scheduled downtime submissions via the CGI interface.
>
> I've seen it happen with "regular" scheduled downtimes as well as the new
> "triggered" scheduled downtime. We thought it might have been related to
> the new triggered downtime, since that was one of the first events causing
> a segfault. But then after eliminating the use of triggered downtimes
> altogether, the segfaults still occur with the regular scheduled downtime
> expirations.
>
> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading"
> hasn't gotten rid of it.
>
> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
> x86-64, Kernel 2.6.5-7.267-smp
>
This is the culprit, I guess. As this isn't a widespread problem, I
wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is
fairly ancient too, but that shouldn't matter as this is the only app
you're seeing it in).
I'm guessing this actually is an SMP-system and that SuSE doesn't
install SMP kernels on all systems, correct? If so, this could also be a
source of problem for you. Nagios doesn't follow the pthread guidelines
very closely and does some pretty inappropriate things post-fork() for
being a threaded application. This could be one of those problems that
doesn't happen on single-cpu systems because the only cpu doesn't have
anything to compete with when racing for the memory.
> 4. We don't have any other segfault problems with other other apps on this
> system.
>
>
> So I'm still trying figure out *what* is overwriting the
> scheduled_downtime structures with garbage in memory.
>
> Any ideas, based upon this additional information?
>
Upgrade glibc and the kernel and pray. Other than that, I guess running
it in valgrind and/or gdb for a long period of time or chucking
assert()'s and printf()'s at the Nagios code and seeing where it breaks
is the only solution.
btw, thanks for the nicely detailed problem report.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Users
mailing list