Why separate hosts and services
Andreas Ericsson
ae at op5.se
Thu Apr 15 22:07:26 CEST 2004
Chris Wilson wrote:
> Hi Andreas,
>
>
>>I can think of at least two good reasons.
>>
>>1) Problem localisation. When a service fails, someone has to fix it. If
>>they don't know what machine it's on the purpose of a monitoring system
>>is soundly defeated.
>>
>>Ofcourse, you could type in the host_address and host_alias in every
>>service-description, but keeping things the way they are really saves a
>>lot of typing compared to that.
>
>
> OK, that's a good point, but it could also be handled by inheriting
> hostname from service to dependent service, unless overridden by the
> dependent service.
>
Not a very good idea, since many servicedependancies have relations
between several hosts (switch interface operability connects to db
loadbalancer connects to database servers).
> Another way would be to report the "path" through the "service tree" to
> the failed service in the notification message. This might actually help
> fault diagnosis. For example, if you receive separate notifications that 4
> machines behind the same router have gone down at the same time, then you
> might assume that the router might be at fault.
>
Great idea. By simply adding the macro $PARENTS$, this can easily be
accomplished, while not modifying any core logic.
> At the moment, with the current notification architecture, I don't think
> you can have enough information to do that, without looking at the status
> CGIs or knowing from memory that the hosts are all behind the same router
> (which doesn't scale well :-)
>
In larger networks there are usually different people handling different
parts of it, and with a proper naming-standard (with a little help from
the 'alias' variable in the host object definition), this has never been
a problem for any of our customers. Some of them have really huge networks.
>
>>2) Notification suppression. If a service fails, nagios immediately
>>checks if the host is down. If it is, no more service checks will be
>>scheduled until the host pops back up.
>
>
> But we already do the same thing for dependent services, don't we? I don't
> understand why the logic is different, and why they can't be combined into
> a single, simple if-down-then-check-parent-service algorithm.
>
Check out the 'parents' variable in host object definition.
>
>>Check out (host- and service-) dependancies. It's all properly documented.
>
>
> To my mind, service dependency is not the same as meta-services (which is
> what I'm talking about).
>
> For example, let's assume we have three services, A, B and C. A is a
> meta-service, and B and C "depend" on it. A does not have any check of its
> own; its state is entirely determined from the states of its dependent
> services. If B and C both fail, then A is determined to have failed, and
> not otherwise.
>
This can be done today, using service dependancies.
> This is not the same as B and C both depending on A, because if B and C
> both fail, then how does one make A fail automatically in Nagios? I don't
> think it's possible, do you?
Yes. What you're talking about is modifications to the core logic.
Having plugins checking this would be 'the long way around'.
> I guess it might involve writing a plugin to
> check the status of all children, and I don't know if Nagios would update
> the status.sav quickly enough that we would be able to determine this
> reliably in the parent check. Do you know if it does?
>
status.sav is the default state retention file, so we can't even count
on it being there. status.log gets updated about 1 second after a state
changes and should be more interesting for something like this.
> Besides which, we would have to parse both the configuration files and
> status.sav to determine this, and neither of those is easy to do.
Not a problem, really. Especially considering the fact that all the code
to do both is right under the nose of anybody who cares to download the
sources.
>
> Cheers, Chris.
--
Mvh / Best Regards
Sourcerer / Andreas Ericsson
OP5 AB
+46 (0)733 709032
andreas.ericsson at op5.se
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
More information about the Developers
mailing list