BUG? Segfault & coredump with scheduled downtime, downtime scheduled horked
karl.kornel at mindspeed.com
karl.kornel at mindspeed.com
Fri Aug 18 02:14:53 CEST 2006
Dear fellow Nagios users,
Ever since downloading my first Nagios tarball (2.0rc2), and
continuing to version 2.5, I have been noticing a big problem with
downtimes. It appears that if there are more than a couple of downtimes
scheduled, Nagios will crash partway through the list. This is a really
big problem, for several reasons: I've written a tool that automatically
schedules recurring downtime. It integrated into the Nagios web site, and
anyone with access to schedule downtime for a host/service can schedule
recurring downtime, and see the other recurring downtimes that have been
scheduled by their fellow contacts. I want to release it publicly, but I
can't finish testing because of Nagios crashing (several downtimes a day
== one crash per day). On a different note, this (Nagios) was the first
open-source tool to become widespread in our IT department. I'd like
Nagios to continue to gain acceptance in our group (most of whom are
SAP/Oracle/Windows/etc.), and this problem doesn't help.
I don't want to look like I'm barging in and saying "Fix this!",
so I brought some stuff along with me, and I did what I could to diagnose
the problem. First of all, I have two core files. When I installed
Nagios, I used the 'install-unstripped' target, so these core files, and
my copy of Nagios, include debugging symbols. Here's what the backtrace
looks like from the most recent coredump:
(gdb) bt
#0 0x00002aaaab20dd20 in strlen () from /lib/libc.so.6
#1 0x000000000042866f in hashfunc2 (
name1=0x44e4f697 <Address 0x44e4f697 out of bounds>,
name2=0x4e202c6900000000 <Address 0x4e202c6900000000 out of bounds>,
hashslots=1024) at utils.c:4285
#2 0x0000000000437d15 in find_service (
host_name=0x44e4f697 <Address 0x44e4f697 out of bounds>,
svc_desc=0x4e202c6900000000 <Address 0x4e202c6900000000 out of
bounds>)
at ../common/objects.c:5016
#3 0x00000000004518cf in handle_scheduled_downtime
(temp_downtime=0xfe6500)
at ../common/downtime.c:311
#4 0x000000000042130e in handle_timed_event (event=0x722320) at
events.c:1289
#5 0x0000000000421893 in event_execution_loop () at events.c:964
#6 0x000000000040eeb2 in main (argc=Variable "argc" is not available.
) at nagios.c:710
(gdb)
I tried to look through the code, and the coredump, and the most I
could determine is this: It looks like the scheduled downtime event
struct was corrupted at some point during its life in the high-priority
event queue (for one thing, between the time Nagios was started and the
time it crashed, no more than 10 downtimes had ever been scheduled, yet
the downtime ID is 81, and no downtime had ever been scheduled that was
2072 hours long):
(gdb) frame 3
#3 0x00000000004518cf in handle_scheduled_downtime
(temp_downtime=0xfe6500)
at ../common/downtime.c:311
311
svc=find_service(temp_downtime->host_name,temp_downtime->service_description);
(gdb) print *temp_downtime
$1 = {type = 0, host_name = 0x44e4f697 <Address 0x44e4f697 out of bounds>,
service_description = 0x4e202c6900000000 <Address 0x4e202c6900000000 out
of bounds>, entry_time = 0, start_time = 2334111869775642625, end_time =
0,
fixed = 6488400, triggered_by = 0, duration = 7459712, downtime_id = 81,
author = 0x2aaa00000000 <Address 0x2aaa00000000 out of bounds>,
comment = 0x44e4f6bf <Address 0x44e4f6bf out of bounds>, comment_id = 0,
is_in_effect = 0, start_flex_downtime = 0, incremented_pending_downtime
= 1,
next = 0x0}
(gdb)
So, I've got two coredumps. When the second coredump took place,
and before restarting Nagios, I tarballed the entire Nagios directory,
including all log files, cache files, etc.. I don't know if the object
cache or downtimes data files would be of any help, but I've got them in
storage.
So, what else? Well, I've looked at the event log for today, and
I did notice something weird: My recurring downtime scheduler schedules
the day's downtimes every day at midnight, writing commands out to the
Nagios command socket. The event logs record receiving 6
SCHEDULE_SVC_DOWNTIME commands, which is correct. The first downtime
started correctly, and ended correctly. However (here's the weird part),
the other downtimes started at the exact same moment the first downtime
ended. Even more weird, the second, third, and fourth downtimes ended
when they should have started. Here's all of the downtime-related entries
from the event log, with the time values converted into readable
dates/times:
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;szlnmail1.shenzhen;CPU;[Tue Aug 15 17:55:00 2006];[
Tue Aug 15 19:30:00 2006];1;0;0;kornelak;Weekday backup.
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;shenzhendc1.shenzhen;CPU;[Tue Aug 15 14:55:00 2006]
;[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily backup
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;westborod2.westboro;CPU;[Tue Aug 15 14:55:00 2006];[
Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily Backup
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;westborom2.westboro;CPU;[Tue Aug 15 15:55:00 2006];[
Tue Aug 15 18:30:00 2006];1;0;0;kornelak;Daily backup
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;hillsborom1.hillsboro;CPU;[Tue Aug 15 22:55:00 2006]
;Tue Aug 15 23:15:00 2006;1;0;0;kornelak;Daily Backup
[2006-08-15 00:11:29] EXTERNAL COMMAND:
SCHEDULE_SVC_DOWNTIME;sophiad1.nice;CPU;[Tue Aug 15 07:55:00 2006];[Tue
Aug 15 09:30:00 2006];1;0;0;kornelak;Daily Backup
[2006-08-15 07:55:03] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STARTED;
Service has entered a period of scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STOPPED;
Service has exited from a period of scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT:
shenzhendc1.shenzhen;CPU;STARTED; Service has entered a period of
scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT:
westborod2.westboro;CPU;STARTED; Service has entered a period of scheduled
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT:
westborom2.westboro;CPU;STARTED; Service has entered a period of scheduled
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT:
szlnmail1.shenzhen;CPU;STARTED; Service has entered a period of scheduled
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT:
hillsborom1.hillsboro;CPU;STARTED; Service has entered a period of
scheduled downtime
[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT:
shenzhendc1.shenzhen;CPU;STOPPED; Service has exited from a period of
scheduled downtime
[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT:
westborod2.westboro;CPU;STOPPED; Service has exited from a period of
scheduled downtime
[2006-08-15 15:55:02] SERVICE DOWNTIME ALERT:
westborom2.westboro;CPU;STOPPED; Service has exited from a period of
scheduled downtime
Nagios crashed at 2006-08-15 16:00, which happens to be the times that the
westborod2.westboro->CPU and westborod2.westboro->CPU downtimes were
supposed to end.
Notice how Nagios was fine until sophiad1.nice came out of
downtime, and suddenly everything else went into downtime!
So, that's all I've got. Hopefully it's enough for someone to run
with it and figure out what's going on. Up to now I've been running
Nagios 2.5. At the time this email goes out, I'll be running the version
of Nagios in CVS (copied from the daily tarball). I'll let you know if
the version in CVS works, but for now I'm going to assume that it does
not. Hopefully this is the right place to ask for help (and to ask if
anyone else has seen this behavior). I'd be happy to resubmit this info
somewhere else, if needed. Thanks in advance for your help!
-- A. Karl Kornel, Mindspeed Technologies, Inc.
karl.kornel at mindspeed.com -- (949) 579-3503
"Remember the Rules: Separation & Optimization"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20060817/032f2064/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list