Why distinguish hosts from services?

Holger Weiss holger at CIS.FU-Berlin.DE
Sat Aug 9 16:22:18 CEST 2008


* Andreas Ericsson <ae at op5.se> [2008-08-09 14:35]:
> Holger Weiss wrote:
> > We use separate host definitions for separate interfaces (so for us, the
> > "host" keyword should really be named "interface" ;-]).  For each host,
> > there's a "primary" interface which all other interfaces depend on using
> > host dependencies.  Now, for example, if we upgrade a system, we'd like
> > to just specify a downtime for the primary interface to make sure that
> > no host or service notifications will be generated whatsoever.  If we
> > just reboot the host, things work as expected.  But during an upgrade,
> > some services will usually go into a hard problem state while the system
> > is still UP.  In this case, only the notifications for the services
> > running on the primary interface will be suppressed, because Nagios does
> > suppress service notifications if the host the service runs on is in a
> > downtime, but not if only a host this host depends on is in a downtime.
> > 
> > Similar problems can occur with parents: if a parent is in a downtime,
> > but the parent's host check returns an UP because the parent still pings
> > although it stopped routing already, notifications for the child(s)
> > won't be suppressed.  Or for service dependencies (though maybe less
> > likely): if the dependent-upon service is in a downtime and the
> > dependent service is stopped before the dependent-upon service is
> > stopped, notifications for the dependent service won't be suppressed.
> > 
> > Apart from that, it would be nice if objects which directly or
> > indirectly depend on an object which is in a downtime would also have
> > some "downtime" status flag set, so that tools such as the web interface
> > could easily mark them as such.  But that's just cosmetic.
> > 
> > To fix such problems once and forever, I'd have to implement various
> > logics at different places in the code: (1) don't notify on a host if a
> > directly or indirectly dependent-upon host is in a downtime; (2) don't
> > notify on the services running on this host; (3) don't notify on a
> > service if a directly or indirectly dependent-upon service is in a
> > downtime; (4) don't notify on a host if a direct or indirect parent is
> > in a downtime (with redundant paths accounted for); (5) maybe don't
> > notify on the services running on this host, either, just to make sure.
> > My dream is that with generic object types and dependencies, I could
> > implement a recursive check for downtimes of dependent-upon objects at a
> > single place in the code and be done with it, which would be much
> > simpler and less error-prone.
>
> A much simpler way of doing it is to set the "notification_options" field
> in the host and service-objects to flags (well, everything that could be
> flags should be flags, really), then it becomes a matter of doing bitfield
> comparisons to see if a notification should be suppressed or not,
> regardless of which type of object it is.

If it were done this way, I'd still have to implement the various checks
I mentioned in order to set the "dependent-upon object is in a downtime"
flag.  So, while your suggestion would save some memory and allow for
using generic macros to compare the current state of an object with the
configured notification_options, it wouldn't really solve my problem.

> One trouble is that to make this generic regardless of which type of object
> you're checking it against means both hosts and services would need to
> understand the same sort of check results

Yes, I just fail to see the trouble.

> as well as the same kind of notification options and everything that
> gets affected by such things

Same here.

> the data structs for both types of objects would need to be identical, which
> would waste memory on a O(n) scale, rather than the fixed-price overhead of
> almost duplicating some of the code.
> 
> Now consider this instead:
> if ((host->notification_options & contact->notification_options) & (1 << host->status))
> 	send_notification;
> 
> And then think you've got a macro for it, which goes like this:
> #define should_notify(obj, contact) \
> 	((obj->notification_options & contact->notification_options) & (1 << obj->status))
> 
> which means you can get the best of both worlds for the things that are
> actually the same (or at least similar enough), while maintaining the
> implicit dependencies without wasting memory in such a horrible
> non-scalable way.

Your argument depends on the assumption that there's some inherent
difference between host and service objects.  If this is true, then
memory would be wasted by including object type specific data into
generic data structures.  However, my question was specifically whether
this assumption actually holds.  I know that Nagios currently believes
that only a service can be in a WARNING or UNKNOWN state and that only a
host can be in an UNREACHABLE state.  So far, I'm not convinced these
dogmas are true :-)

Holger

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list