Multi-level service dependencies... not working
Tim Carr
tcarr at somanetworks.com
Tue Jun 17 16:01:35 CEST 2003
Ok, i've sent this to the Master :)
> I must say that the setting I mentioned earlier has never let me down
> before. I guess that if you set your check interval of the service you
> are depending on small enough, you should be fine.
Unfortunately i've always had that option enabled (soft state
dependencies), so it's not fixing things for me.
> But it might be best to take this up with Ethan or anyone else who can
> shed some light...
Right, time to fill him in... Ethan, the problem i'm having is that even
with soft-state dependencies turned on, service dependencies aren't
being properly enforced.
For instance, if I have Nagios monitoring services A and B, with A
dependent-upon B, if I shut down _both_ A and B quite often what happens
is
1. Nagios checks A first (ok)
2. determines it is down (ok)
3. immediately calls A's eventhandler (bad)
I realize that it can't know exactly when something goes down (nor the
order in which things went bad), but the behavior that I want is this
(well, this is the behavior I want *given* step 1; i realize sometimes
it will check B first):
1. Nagios checks A first
2. determines it is down
3. immediately actively re-checks B
4. (immediately active re-checks any other services that A depends-upon)
5. if all services that A depends-upon are OK, it calls A's
eventhandler; otherwise, it suspends checking/eventhandling for A until
this is so.
I realize that it's possible that A will be checked before B, but upon
finding error I want the services that A depends-upon checked right
away.
Is this possible in Nagios 1.0/1.1 ? Is this planned for 2.x ?
Stanley Hopcroft writes the following (figured i'd put this all into one
email):
--- begin stanley ---
(Stanley pastes some Nagios documentation here)
Point 1 says that _before_ a dependent service is checked, the
__current__ state of the dependency is checked and the action taken as
specified in the dependency def.
Therefore it seems to me that this is allows the dependent service to
fail without Nag following the dependency spcification because the
dependent service has failed but has not been checked.
You could try and reduce or eliminate the impact of this by having the
rate of checks of the dependency greater than the dependent checks
(extra marks to relate the ratio of dependency to dependent checks to
the risk/probability of a false alarm).
--- end stanley ---
Stanley's suggestion is that by increasing the rate of checks, hopefully
one would catch the failed service that is depended-upon before catching
the service that depends upon it. This doesn't seem right to me, as
there will always be a margin of error no matter how you manipulate the
check rates. In my case, no margin of error is acceptable (seriously,
i'm not making this up, ask for more details if you want 'em).
The solution it seems (to me), is to have nagios actively re-check the
service that is depended upon, instead of just using its current state
(as suggested above in the second ordered-list)
Many thanks for your time,
Tim Carr
-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list