Monitoring cross-server services?
Steven Grimm
koreth-nagios at midwinter.com
Wed Jan 22 06:56:12 CET 2003
Ran a quick test with service dependencies as you suggested, and I'm now
convinced they aren't sufficient for monitoring a peer-to-peer application.
And really there's no way they can be, because they don't take the details
of error reports into account.
Here's what happens with service dependencies, again using hosts A, B, and
C which are all connected to each other. I kill the app on host B as you
describe, and Nagios notifies me that it's died. Host A reports an error
condition because it isn't connected to host B, and thanks to the service
dependency, that error doesn't cause a second notification. So far so good.
Now I tweak host A so the app there can't reach its counterpart on host C,
a condition which *should* trigger a notification. (The monitoring host
can still reach the app on host C.) Host A reports that it can't reach
hosts B or C. And notification of that error gets suppressed by the
dependency on host B's service, which is still down. Nagios knows that
host A's service is broken and sees that a depended-on service isn't
running, so therefore it suppresses the notification without regard to
*why* the failure is happening.
Hope that example makes more sense than my previous ones.
What I need here is a "service" that's really a comparison between Nagios'
view of the current state of the world (whether the P2P app looks alive
on all peers from the point of view of the monitoring host) and a plugin's
view of the state of the world (whether the app looks alive on all peers
from the point of view of the particular host being checked.) If the
plugin and Nagios agree about what's up and what's down, it's not a
failure, but any discrepancy between those two views of the world *does*
indicate a problem.
Even setting aside my particular setup, that ability would be of value on
large networks with complex routing, anywhere it's possible for hosts to
lose connectivity to each other while remaining reachable from the
monitoring host.
Like I said in my original message, I can work around this by parsing
the status file myself, not a big problem. Once it's in a presentable
state I'll post my workaround, which this discussion has convinced me
to make a bit more general-purpose than I'd originally planned.
-Steve
On Tue, Jan 21, 2003 at 11:18:33AM -0600, Carroll, Jim P [Contractor] wrote:
> I've got quite a number of service dependencies defined, and they add
> absolutely nothing to the service detail page. Basically I set up a
> rudimentary check for NRPE ('echo "NRPE is OK"' in nrpe.cfg), and made all
> the other NRPE checks for that host dependent on the rudimentary check. If
> NRPE is down, I want *one* page, not however many NRPE checks I'm doing.
> Ordinarily I wouldn't expect NRPE to be down (since it's kicked off from
> (x)inetd), but if an admin rebuilds a host or makes some other unfortunate
> change to (x)inetd, we don't want to be flooded with notifications; a simple
> "um... excuse me? NRPE doesn't seem to be up" is just fine.
>
> My case differs from your case in that my dependencies occur on the same
> host, and yours occur on different hosts. Having said that, if I
> acknowledge that NRPE is down (the depended-on service), that doesn't
> automatically flag any other services, or the host itself for that matter,
> as being down/acknowledged/ignored.
>
> If you're still uncertain, your best bet is to create a trivial case on 2 or
> 3 of your lesser hosts. Use netcat to listen on some arbitrary ports, and
> have Nagios poke at those 'services'. Then kill netcat on the 'depended-on'
> host. Wait for Nagios to notify you. Acknowledge it. Kill netcat on one
> of the 'dependent' hosts. Wait for Nagios to notify you. And wait and
> wait, because you shouldn't hear a peep. Bring netcat back up on the first
> host. Eventually you should get a notification that the 'service' on the
> second host is down.
>
> In this scenario, any other services should be completely independent;
> Nagios should still notify you if one of those goes down. (Feel free to
> test this in whatever permutations/combinations with this scenario, as
> well.)
>
> HTH.
>
> jc
-------------------------------------------------------
This SF.net email is sponsored by: Scholarships for Techies!
Can't afford IT training? All 2003 ictp students receive scholarships.
Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
www.ictp.com/training/sourceforge.asp
More information about the Users
mailing list