Dear fellow Nagios users, Ever since downloading my first Nagios tarball (2.0rc2), and continuing to version 2.5, I have been noticing a big problem with downtimes. It appears that if there are more than a couple of downtimes scheduled, Nagios will crash partway through the list. This is a really big problem, for several reasons: I've written a tool that automatically schedules recurring downtime. It integrated into the Nagios web site, and anyone with access to schedule downtime for a host/service can schedule recurring downtime, and see the other recurring downtimes that have been scheduled by their fellow contacts. I want to release it publicly, but I can't finish testing because of Nagios crashing (several downtimes a day == one crash per day). On a different note, this (Nagios) was the first open-source tool to become widespread in our IT department. I'd like Nagios to continue to gain acceptance in our group (most of whom are SAP/Oracle/Windows/etc.), and this problem doesn't help. I don't want to look like I'm barging in and saying "Fix this!", so I brought some stuff along with me, and I did what I could to diagnose the problem. First of all, I have two core files. When I installed Nagios, I used the 'install-unstripped' target, so these core files, and my copy of Nagios, include debugging symbols. Here's what the backtrace looks like from the most recent coredump: <tt>(gdb) bt</tt> <tt>#0 0x00002aaaab20dd20 in strlen () from /lib/libc.so.6</tt> <tt>#1 0x000000000042866f in hashfunc2 (</tt> <tt> name1=0x44e4f697 <Address 0x44e4f697 out of bounds>,</tt> <tt> name2=0x4e202c6900000000 <Address 0x4e202c6900000000 out of bounds>,</tt> <tt> hashslots=1024) at utils.c:4285</tt> <tt>#2 0x0000000000437d15 in find_service (</tt> <tt> host_name=0x44e4f697 <Address 0x44e4f697 out of bounds>,</tt> <tt> svc_desc=0x4e202c6900000000 <Address 0x4e202c6900000000 out of bounds>)</tt> <tt> at ../common/objects.c:5016</tt> <tt>#3 0x00000000004518cf in handle_scheduled_downtime (temp_downtime=0xfe6500)</tt> <tt> at ../common/downtime.c:311</tt> <tt>#4 0x000000000042130e in handle_timed_event (event=0x722320) at events.c:1289</tt> <tt>#5 0x0000000000421893 in event_execution_loop () at events.c:964</tt> <tt>#6 0x000000000040eeb2 in main (argc=Variable "argc" is not available.</tt> <tt>) at nagios.c:710</tt> <tt>(gdb) </tt> I tried to look through the code, and the coredump, and the most I could determine is this: It looks like the scheduled downtime event struct was corrupted at some point during its life in the high-priority event queue (for one thing, between the time Nagios was started and the time it crashed, no more than 10 downtimes had ever been scheduled, yet the downtime ID is 81, and no downtime had ever been scheduled that was 2072 hours long): <tt>(gdb) frame 3</tt> <tt>#3 0x00000000004518cf in handle_scheduled_downtime (temp_downtime=0xfe6500)</tt> <tt> at ../common/downtime.c:311</tt> <tt>311 svc=find_service(temp_downtime->host_name,temp_downtime->service_description);</tt> <tt>(gdb) print *temp_downtime</tt> <tt>$1 = {type = 0, host_name = 0x44e4f697 <Address 0x44e4f697 out of bounds>,</tt> <tt> service_description = 0x4e202c6900000000 <Address 0x4e202c6900000000 out of bounds>, entry_time = 0, start_time = 2334111869775642625, end_time = 0,</tt> <tt> fixed = 6488400, triggered_by = 0, duration = 7459712, downtime_id = 81,</tt> <tt> author = 0x2aaa00000000 <Address 0x2aaa00000000 out of bounds>,</tt> <tt> comment = 0x44e4f6bf <Address 0x44e4f6bf out of bounds>, comment_id = 0,</tt> <tt> is_in_effect = 0, start_flex_downtime = 0, incremented_pending_downtime = 1,</tt> <tt> next = 0x0}</tt> <tt>(gdb) </tt> So, I've got two coredumps. When the second coredump took place, and before restarting Nagios, I tarballed the entire Nagios directory, including all log files, cache files, etc.. I don't know if the object cache or downtimes data files would be of any help, but I've got them in storage. So, what else? Well, I've looked at the event log for today, and I did notice something weird: My recurring downtime scheduler schedules the day's downtimes every day at midnight, writing commands out to the Nagios command socket. The event logs record receiving 6 SCHEDULE_SVC_DOWNTIME commands, which is correct. The first downtime started correctly, and ended correctly. However (here's the weird part), the other downtimes started at the exact same moment the first downtime ended. Even more weird, the second, third, and fourth downtimes ended when they should have started. Here's all of the downtime-related entries from the event log, with the time values converted into readable dates/times: <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;szlnmail1.shenzhen;CPU;[Tue Aug 15 17:55:00 2006];[Tue Aug 15 19:30:00 2006];1;0;0;kornelak;Weekday backup.</tt> <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;shenzhendc1.shenzhen;CPU;[Tue Aug 15 14:55:00 2006];[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily backup</tt> <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;westborod2.westboro;CPU;[Tue Aug 15 14:55:00 2006];[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily Backup</tt> <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;westborom2.westboro;CPU;[Tue Aug 15 15:55:00 2006];[Tue Aug 15 18:30:00 2006];1;0;0;kornelak;Daily backup</tt> <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;hillsborom1.hillsboro;CPU;[Tue Aug 15 22:55:00 2006];Tue Aug 15 23:15:00 2006;1;0;0;kornelak;Daily Backup</tt> <tt>[2006-08-15 00:11:29] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;sophiad1.nice;CPU;[Tue Aug 15 07:55:00 2006];[Tue Aug 15 09:30:00 2006];1;0;0;kornelak;Daily Backup</tt> <tt>[2006-08-15 07:55:03] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STOPPED; Service has exited from a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: shenzhendc1.shenzhen;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: westborod2.westboro;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: westborom2.westboro;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: szlnmail1.shenzhen;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: hillsborom1.hillsboro;CPU;STARTED; Service has entered a period of scheduled downtime</tt> <tt>[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: shenzhendc1.shenzhen;CPU;STOPPED; Service has exited from a period of scheduled downtime</tt> <tt>[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: westborod2.westboro;CPU;STOPPED; Service has exited from a period of scheduled downtime</tt> <tt>[2006-08-15 15:55:02] SERVICE DOWNTIME ALERT: westborom2.westboro;CPU;STOPPED; Service has exited from a period of scheduled downtime</tt> Nagios crashed at 2006-08-15 16:00, which happens to be the times that the westborod2.westboro->CPU and westborod2.westboro->CPU downtimes were supposed to end. Notice how Nagios was fine until sophiad1.nice came out of downtime, and suddenly everything else went into downtime! So, that's all I've got. Hopefully it's enough for someone to run with it and figure out what's going on. Up to now I've been running Nagios 2.5. At the time this email goes out, I'll be running the version of Nagios in CVS (copied from the daily tarball). I'll let you know if the version in CVS works, but for now I'm going to assume that it does not. Hopefully this is the right place to ask for help (and to ask if anyone else has seen this behavior). I'd be happy to resubmit this info somewhere else, if needed. Thanks in advance for your help! -- A. Karl Kornel, Mindspeed Technologies, Inc. karl.kornel@mindspeed.com -- (949) 579-3503 "Remember the Rules: Separation & Optimization"