Bug report: nagios shutdown removing lock file too early
Ethan Galstad
nagios at nagios.org
Wed Jun 21 02:57:45 CEST 2006
Ton Voon wrote:
> On 19 Jun 2006, at 21:46, Ethan Galstad wrote:
>> Ton Voon wrote:
>>> Ethan,
>>>
[snip]
>
> I think the lockfile removal is the source of the "multiple Nagios
> processes running". The example daemon-init script uses the lockfile
> as the status of the process. If you were to do a restart, Nagios
> would complete the stop because the signal was sent, but Nagios would
> actually be in the process of shutting down. Meanwhile a start would
> run, so another Nagios process is kicked off. Then, as both Nagios
> processes are trying to access the same files, mayhem can ensue :)
>
> We've got our own startup script and we've change the stop routine to
> wait until nagios has actually stopped before moving out of the stop
> function. Much more stable, but there's a long delay if Nagios is in
> the middle of a host check.
>
>> The file gets
>> deleted immediately upon receiving a SIGHUP/etc. to prevent it from
>> staying around if Nagios has problems shutting down.
>
> I see why, but I think it is probably better to leave the lock file
> around if there was a problem shutting down, and handle the existence
> of the lock file on startup.
>
> Ton
From looking at the code, it looks like I intended to clean this up at
some point, but never did. main() in nagios.c deletes the lock file as
one of the last things it does before exiting, but the file was still
being prematurely removed in sighandler() in utils.c. I just
uncommented the calls in sighandler(), so this should be fixed.
Also, I did add some checks in base/checks.c to bail out of the host
check logic at reasonable points if a SIGHUP/SIGINT is encountered. A
stop/restart may still not be immediate, because the signal doesn't kill
a single host check command from executing, but it should prevent Nagios
from re-checking a host (or propagating checks to parents/children) when
a signal is encountered.
I'll be posting the patches to CVS shortly, so if anyone has a chance to
test this, please let me know how it works. Thanks again Ton for the
heads up on this!
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
More information about the Developers
mailing list