Fix for host dependency checks
Ethan Galstad
nagios at nagios.org
Wed Mar 22 02:17:58 CET 2006
On 22 Mar 2006 at 1:48, Holger Weiss wrote:
> * Ethan Galstad <nagios at nagios.org> [2006-03-21 12:50]:
> > On 24 Feb 2006 at 19:04, Holger Weiss wrote:
> > > * Holger Weiss <holger at CIS.FU-Berlin.DE> [2006-01-30 16:54]:
> > > > There is a timing problem in the host[*] dependency check logic:
> > > > If host B is configured to be dependent on host A being up and
> > > > host A goes down, the dependency will only fail if host A
> > > > "incidentally" was checked _prior_ to host B after going down.
> > > > Hence, the host dependency logic will sometimes work and
> > > > sometimes not. I'd therefore suggest to explicitly (re-)check
> > > > host A during the dependency checking for host B, as the
> > > > attached patch does.
> > >
> > > Okay, this introduces a new problem: If host B is checked
> > > immediately before and host A (during the dependency check) after
> > > a recovery of both hosts, the dependency won't fail. Hence,
> > > notifications for host B won't be suppressed (been there, got the
> > > t-shirt).
> > >
> > > Next try: The attached patch lets the dependency fail if either
> > > the current or the previous (hard) state of A matches the failure
> > > criteria. AFAICS, this should reliably suppress notifications for
> > > host B if the dependency fails.
> >
> > I'll keep this on the TODO list for Nagios 3.x, but I think it might
> > require some more thought. The last hard state of the host should
> > only be used in the dependency logic if a state change occurred
> > relatively recently. If, for example, the last hard state change
> > occurred two days ago, you don't want that value used in the logic.
>
> Okay, but the current Nagios code uses _only_ the last hard state (no
> matter how "old" it is), which is the reason why I've encountered the
> problem in the first place. I thought about checking the freshness of
> the last hard state myself (the information is available in the host
> struct already, so this would be easy), but then I omitted that since
> letting the dependency fail if either the current or the last hard
> state matches the criteria seemed sufficiently safe to me. This way,
> "false alarms" for the (dependent) host B should reliably be
> prevented, while the risk of suppressing legitimate notifications for
> B because the dependency fails due to an outdated last hard state of A
> is the same as with the current Nagios code. I believe that in
> practice, this risk is very low: I suppose that in almost all cases,
> the configured dependency criteria will be a down and/or unreachable
> state. So the risk would be that an outdated down or unreachable
> state lets the dependency fail, but down and unreachable states should
> normally be more or less up-to-date.
>
> In any case, many thanks for looking into this issue!
>
> Holger
>
Aha - I think we're using different terms. :-) The nagios 2.x code
uses host->current_state in the dependency logic, but that's not
necessarily "current" in terms of time.
I made some major overhauls to the host check logic in the Nagios 3.x
CVS code. Those changes include parallel host checks and "predictive
dependency checks". The predictive checks idea came from your
earlier suggestion that all hosts that are depended upon for
notification be checked before the notification gets sent out.
Here's how the Nagios 3.x code does this... On the second to the last
max host check attempt, Nagios will execute a parallel check of all
hosts that are being depended upon. In Nagios 3.x, host checks are
no longer performed immediately after each other, but at a
retry_interval, just as services are re-checked. That means that
theoretically all hosts that are being depended upon will have been
checked before the dependency logic is tested and a decision to
notify is made.
I'm working on the doc updates for Nagios 3.x in the next 2-3 weeks.
Once I'm done I'll have folks try out the new code and see how it
performs. From my limited testing, things should be much snappier
with regards to host checks during outages and in general.
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
More information about the Developers
mailing list