<br><font size=2 face="sans-serif">Dear fellow Nagios users,</font>
<br>
<br><font size=2 face="sans-serif"> Ever
since downloading my first Nagios tarball (2.0rc2), and continuing to version
2.5, I have been noticing a big problem with downtimes. It appears
that if there are more than a couple of downtimes scheduled, Nagios will
crash partway through the list. This is a really big problem, for
several reasons: I've written a tool that automatically schedules
recurring downtime. It integrated into the Nagios web site, and anyone
with access to schedule downtime for a host/service can schedule recurring
downtime, and see the other recurring downtimes that have been scheduled
by their fellow contacts. I want to release it publicly, but I can't
finish testing because of Nagios crashing (several downtimes a day == one
crash per day). On a different note, this (Nagios) was the first
open-source tool to become widespread in our IT department. I'd like
Nagios to continue to gain acceptance in our group (most of whom are SAP/Oracle/Windows/etc.),
and this problem doesn't help.</font>
<br>
<br><font size=2 face="sans-serif"> I
don't want to look like I'm barging in and saying "Fix this!",
so I brought some stuff along with me, and I did what I could to diagnose
the problem. First of all, I have two core files. When I installed
Nagios, I used the 'install-unstripped' target, so these core files, and
my copy of Nagios, include debugging symbols. Here's what the backtrace
looks like from the most recent coredump:</font>
<br>
<br><font size=2><tt>(gdb) bt</tt></font>
<br><font size=2><tt>#0 0x00002aaaab20dd20 in strlen () from /lib/libc.so.6</tt></font>
<br><font size=2><tt>#1 0x000000000042866f in hashfunc2 (</tt></font>
<br><font size=2><tt> name1=0x44e4f697 <Address 0x44e4f697
out of bounds>,</tt></font>
<br><font size=2><tt> name2=0x4e202c6900000000 <Address
0x4e202c6900000000 out of bounds>,</tt></font>
<br><font size=2><tt> hashslots=1024) at utils.c:4285</tt></font>
<br><font size=2><tt>#2 0x0000000000437d15 in find_service (</tt></font>
<br><font size=2><tt> host_name=0x44e4f697 <Address 0x44e4f697
out of bounds>,</tt></font>
<br><font size=2><tt> svc_desc=0x4e202c6900000000 <Address
0x4e202c6900000000 out of bounds>)</tt></font>
<br><font size=2><tt> at ../common/objects.c:5016</tt></font>
<br><font size=2><tt>#3 0x00000000004518cf in handle_scheduled_downtime
(temp_downtime=0xfe6500)</tt></font>
<br><font size=2><tt> at ../common/downtime.c:311</tt></font>
<br><font size=2><tt>#4 0x000000000042130e in handle_timed_event
(event=0x722320) at events.c:1289</tt></font>
<br><font size=2><tt>#5 0x0000000000421893 in event_execution_loop
() at events.c:964</tt></font>
<br><font size=2><tt>#6 0x000000000040eeb2 in main (argc=Variable
"argc" is not available.</tt></font>
<br><font size=2><tt>) at nagios.c:710</tt></font>
<br><font size=2><tt>(gdb) </tt></font>
<br><font size=2 face="sans-serif"><br>
I tried to look through the code, and
the coredump, and the most I could determine is this: It looks like
the scheduled downtime event struct was corrupted at some point during
its life in the high-priority event queue (for one thing, between the time
Nagios was started and the time it crashed, no more than 10 downtimes had
ever been scheduled, yet the downtime ID is 81, and no downtime had ever
been scheduled that was 2072 hours long):</font>
<br>
<br><font size=2><tt>(gdb) frame 3</tt></font>
<br><font size=2><tt>#3 0x00000000004518cf in handle_scheduled_downtime
(temp_downtime=0xfe6500)</tt></font>
<br><font size=2><tt> at ../common/downtime.c:311</tt></font>
<br><font size=2><tt>311
svc=find_service(temp_downtime->host_name,temp_downtime->service_description);</tt></font>
<br><font size=2><tt>(gdb) print *temp_downtime</tt></font>
<br><font size=2><tt>$1 = {type = 0, host_name = 0x44e4f697 <Address
0x44e4f697 out of bounds>,</tt></font>
<br><font size=2><tt> service_description = 0x4e202c6900000000 <Address
0x4e202c6900000000 out of bounds>, entry_time = 0, start_time = 2334111869775642625,
end_time = 0,</tt></font>
<br><font size=2><tt> fixed = 6488400, triggered_by = 0, duration
= 7459712, downtime_id = 81,</tt></font>
<br><font size=2><tt> author = 0x2aaa00000000 <Address 0x2aaa00000000
out of bounds>,</tt></font>
<br><font size=2><tt> comment = 0x44e4f6bf <Address 0x44e4f6bf
out of bounds>, comment_id = 0,</tt></font>
<br><font size=2><tt> is_in_effect = 0, start_flex_downtime = 0,
incremented_pending_downtime = 1,</tt></font>
<br><font size=2><tt> next = 0x0}</tt></font>
<br><font size=2><tt>(gdb) </tt></font>
<br>
<br><font size=2 face="sans-serif"> So,
I've got two coredumps. When the second coredump took place, and
before restarting Nagios, I tarballed the entire Nagios directory, including
all log files, cache files, etc.. I don't know if the object cache
or downtimes data files would be of any help, but I've got them in storage.</font>
<br>
<br><font size=2 face="sans-serif"> So,
what else? Well, I've looked at the event log for today, and I did
notice something weird: My recurring downtime scheduler schedules
the day's downtimes every day at midnight, writing commands out to the
Nagios command socket. The event logs record receiving 6 SCHEDULE_SVC_DOWNTIME
commands, which is correct. The first downtime started correctly,
and ended correctly. However (here's the weird part), the other downtimes
started at the exact same moment the first downtime ended. Even more
weird, the second, third, and fourth downtimes ended when they should have
started. Here's all of the downtime-related entries from the event
log, with the time values converted into readable dates/times:</font>
<br>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;szlnmail1.shenzhen;CPU;[Tue
Aug 15 17:55:00 2006];[Tue Aug 15 19:30:00 2006];1;0;0;kornelak;Weekday
backup.</tt></font>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;shenzhendc1.shenzhen;CPU;[Tue
Aug 15 14:55:00 2006];[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily backup</tt></font>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;westborod2.westboro;CPU;[Tue
Aug 15 14:55:00 2006];[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily Backup</tt></font>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;westborom2.westboro;CPU;[Tue
Aug 15 15:55:00 2006];[Tue Aug 15 18:30:00 2006];1;0;0;kornelak;Daily backup</tt></font>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;hillsborom1.hillsboro;CPU;[Tue
Aug 15 22:55:00 2006];Tue Aug 15 23:15:00 2006;1;0;0;kornelak;Daily Backup</tt></font>
<br><font size=2><tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;sophiad1.nice;CPU;[Tue
Aug 15 07:55:00 2006];[Tue Aug 15 09:30:00 2006];1;0;0;kornelak;Daily Backup</tt></font>
<br><font size=2><tt>[2006-08-15 07:55:03] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STOPPED;
Service has exited from a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: shenzhendc1.shenzhen;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: westborod2.westboro;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: westborom2.westboro;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: szlnmail1.shenzhen;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: hillsborom1.hillsboro;CPU;STARTED;
Service has entered a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: shenzhendc1.shenzhen;CPU;STOPPED;
Service has exited from a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: westborod2.westboro;CPU;STOPPED;
Service has exited from a period of scheduled downtime</tt></font>
<br><font size=2><tt>[2006-08-15 15:55:02] SERVICE DOWNTIME ALERT: westborom2.westboro;CPU;STOPPED;
Service has exited from a period of scheduled downtime</tt></font>
<br><font size=2 face="sans-serif">Nagios crashed at 2006-08-15 16:00,
which happens to be the times that the westborod2.westboro->CPU and
westborod2.westboro->CPU downtimes were supposed to end.</font>
<br>
<br><font size=2 face="sans-serif"> Notice
how Nagios was fine until sophiad1.nice came out of downtime, and suddenly
everything else went into downtime!</font>
<br>
<br><font size=2 face="sans-serif"> So,
that's all I've got. Hopefully it's enough for someone to run with
it and figure out what's going on. Up to now I've been running Nagios
2.5. At the time this email goes out, I'll be running the version
of Nagios in CVS (copied from the daily tarball). I'll let you know
if the version in CVS works, but for now I'm going to assume that it does
not. Hopefully this is the right place to ask for help (and to ask
if anyone else has seen this behavior). I'd be happy to resubmit
this info somewhere else, if needed. Thanks in advance for your help!</font>
<br>
<br><font size=2 face="sans-serif">-- A. Karl Kornel, Mindspeed Technologies,
Inc.<br>
karl.kornel@mindspeed.com -- (949) 579-3503<br>
"Remember the Rules: Separation & Optimization"</font>