Why distinguish hosts from services?
Holger Weiss
holger at CIS.FU-Berlin.DE
Thu Aug 7 14:22:11 CEST 2008
* Andreas Ericsson <ae at op5.se> [2008-08-07 10:32]:
> Holger Weiss wrote:
> > Nagios implements a basic design decision I never quite understood: the
> > distinction between hosts and services. This distinction seems to add
> > quite a lot of complexity, such as duplicated code, four different types
> > of dependencies (parents, host dependencies, service dependencies, and
> > the implicit service->host dependencies), and so on. I don't really see
> > the gain over simply dealing with arbitrary "objects" and dependencies
> > between them, which would reduce complexity and provide more flexibility
> > (such as the possibility to let some service depend on a host it's not
> > running on, or the other way round).
> >
> > Note that I don't doubt the usefulness of syntactic configuration sugar,
> > such as the implicit service->host dependencies or the nice and simple
> > way of mapping the network topology using the "parents" directive. The
> > thing I don't really understand is why Nagios distinguishes hosts from
> > services internally (outside the configuration parser). However, I may
> > well be overlooking something, so I figured I'd ask what it is :-)
> >
> > In any case, giving up such basic distinction would of course require
> > dramatic changes to Nagios' core, I'm not seriously suggesting to do
> > something like that anytime soon (so this posting probably isn't very
> > constructive, sorry). I'm just asking out of curiosity.
>
> I believe it originated from the fact that object dependencies originally
> consisted almost solely of the implicit service->host dependencies, which
> came naturally from just thinking about the network in the first place.
>
> Anyway, I'm not convinced that re-arranging the dependency stuff will make
> things any easier. It's not exactly hard to do it properly in the nagios
> core today, and I'm having trouble imagining a simple enough config syntax
> without the host->parent dependency stuff. Have you thought anything about
> that? If so, what's your suggestion?
I wouldn't want to give up stuff like the "parents" directive or
implicit service->host dependencies. While I can imagine a syntax which
would give the user more control over their semantics, increasing the
user's flexibility isn't really my main point. My question is why
hosts, services, and the various dependency types are handled separately
in Nagios' core, as opposed to them just being syntactic sugar which is
resolved into generic objects and dependencies by the configuration
parser.
This question first came to my mind while stumbling over the issues with
Nagios 2.x's host check logic and some problems with host dependencies.
While thinking about how they should be fixed, I thought that the
service check and dependency logic already works quite well, and as I
couldn't really see the inherent difference of host and service objects,
I thought about whether the separate logic for hosts could maybe just be
dropped in favour of a generic logic for all monitored objects and their
dependencies. (IIRC, there even existed some project which suggested to
avoid host checks by replacing them with service checks/dependencies
entirely?) Anyway, with Nagios 3.x, these issues are mostly solved, so
if that would've been the first Nagios release I used, I maybe would
never have thought about it :-)
However, my (naive?) thought would still be that dealing with generic
objects and dependencies between them could significantly reduce
complexity and duplication of code. Nagios' core includes loads of
host_foo() and service_foo() functions which do similar stuff (or
different stuff, but I've yet to see a case where I really understand
why the difference is necessary), and it includes separate code for the
different dependency types.
To give a concrete example of a problem I still have with Nagios 3.x
which gives me the feeling that these distinctions sometimes complicate
things unnecessarily:
We use separate host definitions for separate interfaces (so for us, the
"host" keyword should really be named "interface" ;-]). For each host,
there's a "primary" interface which all other interfaces depend on using
host dependencies. Now, for example, if we upgrade a system, we'd like
to just specify a downtime for the primary interface to make sure that
no host or service notifications will be generated whatsoever. If we
just reboot the host, things work as expected. But during an upgrade,
some services will usually go into a hard problem state while the system
is still UP. In this case, only the notifications for the services
running on the primary interface will be suppressed, because Nagios does
suppress service notifications if the host the service runs on is in a
downtime, but not if only a host this host depends on is in a downtime.
Similar problems can occur with parents: if a parent is in a downtime,
but the parent's host check returns an UP because the parent still pings
although it stopped routing already, notifications for the child(s)
won't be suppressed. Or for service dependencies (though maybe less
likely): if the dependent-upon service is in a downtime and the
dependent service is stopped before the dependent-upon service is
stopped, notifications for the dependent service won't be suppressed.
Apart from that, it would be nice if objects which directly or
indirectly depend on an object which is in a downtime would also have
some "downtime" status flag set, so that tools such as the web interface
could easily mark them as such. But that's just cosmetic.
To fix such problems once and forever, I'd have to implement various
logics at different places in the code: (1) don't notify on a host if a
directly or indirectly dependent-upon host is in a downtime; (2) don't
notify on the services running on this host; (3) don't notify on a
service if a directly or indirectly dependent-upon service is in a
downtime; (4) don't notify on a host if a direct or indirect parent is
in a downtime (with redundant paths accounted for); (5) maybe don't
notify on the services running on this host, either, just to make sure.
My dream is that with generic object types and dependencies, I could
implement a recursive check for downtimes of dependent-upon objects at a
single place in the code and be done with it, which would be much
simpler and less error-prone.
Hol-"you may say, I'm a dreamer"-ger
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
More information about the Developers
mailing list