Bug report: nagios shutdown removing lock file too early
Ton Voon
ton.voon at altinity.com
Tue Jun 20 16:56:44 CEST 2006
On 19 Jun 2006, at 21:46, Ethan Galstad wrote:
> Ton Voon wrote:
>> Ethan,
>>
>> I think I've seen a problem with the nagios shutdown routine. If
>> nagios is doing a host check and a INT signal is sent, it seems to
>> take a long time before the nagios daemon dies. It looks like the
>> child nagios process is trying to complete all the retries for a host
>> check before going back into the main loop.
>>
>> Also, it appears that the lockfile is being removed before the main
>> process dies. Below is the output for a 'while true; do ps -p 728; ls
>> -l /usr/local/nagios/var/nagios.lock; sleep 1; done' during a kill
>> 728.
>>
>> [snipped]
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r-- 1 nagios nagios 4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r-- 1 nagios nagios 4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>> PID TT STAT TIME COMMAND
>> 728 ?? Ss 0:01.95 /usr/local/nagios/bin/nagios -d /usr/
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>>
>> This shows the lockfile gets removed before the main daemon dies.
>> (This is from a kill 728, not using any init scripts.) Eventually the
>> daemon dies.
>>
>> I've tested this on Nagios 2.2 on MacOSX 10.4, Nagios 2.0 on Debian
>> and Nagios 2.4 on Debian.
>>
>> Sorry, not had time to delve into the source code.
>
> Yep, this is a bug. Its been present for several years now, so I
> suppose we could get around to fixing it. :-) Is the early lockfile
> removal causing noticeable problems with anything?
I think the lockfile removal is the source of the "multiple Nagios
processes running". The example daemon-init script uses the lockfile
as the status of the process. If you were to do a restart, Nagios
would complete the stop because the signal was sent, but Nagios would
actually be in the process of shutting down. Meanwhile a start would
run, so another Nagios process is kicked off. Then, as both Nagios
processes are trying to access the same files, mayhem can ensue :)
We've got our own startup script and we've change the stop routine to
wait until nagios has actually stopped before moving out of the stop
function. Much more stable, but there's a long delay if Nagios is in
the middle of a host check.
> The file gets
> deleted immediately upon receiving a SIGHUP/etc. to prevent it from
> staying around if Nagios has problems shutting down.
I see why, but I think it is probably better to leave the lock file
around if there was a problem shutting down, and handle the existence
of the lock file on startup.
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
More information about the Developers
mailing list