Host Schedule Queuing Issue Nagios 1.2
Shawn Iverson
shawn at nccsc.k12.in.us
Wed Mar 16 16:44:33 CET 2005
Please see the following postings linked at the end of this one for
reference and please read carefully if this topic is of interest to you.
Nagios, using version 1.2, does not seem to be intelligently reordering
its queue or spawning a new queue when it discovers that a host is
apparently down in order to make sure that a device between nagios and
the host is not actually the reason the host is unreachable.
What I see happening instead is that nagios continues to follow its
predetermined queue and will only mark child hosts as unreachable after
it has previously found a parent host down from its queue. The farther
down the queue the parent host is (or the actual problem, for that
matter), the longer it takes for nagios to realize that the downed child
host is actually unreachable instead of down, and the more DOWN alerts a
person will receive for any other child hosts that should be unreachable
instead of down. The reason is that these hosts would precede the
parent host check in the queue. This scenario doesn't always happen, of
course, because it depends on the order of hosts in the queue, but I see
that the larger the network is and the greater the outage is, the
greater that this problem becomes.
Here's a simplified example.
Say I have nagios, a switch called switch A to which nagios is
connected, a router called router A with its parent switch A, a second
router called router B with router A as its parent, and perhaps a switch
B with router B as its parent, and then a host with switch B as its
parent.
Say that router A goes down, and the queue looks like this at that time
(almost a worst case scenario):
host
router B
switch B
router A
switch A
This is the order of events that seems to occur:
1) A DOWN alert is sent for the host (because it precedes all other
hosts in the queue)
2) A DOWN alert is sent for router B (because it precedes router A in
the queue).
3) An UNREACHABLE event is sent for switch B (because router B precedes
it in the queue).
4) A DOWN event is sent for router A (because it precedes switch A in
the queue)
5) switch A is ok. No event.
Nagios does revise the DOWN events for the host and router B but it does
so only after literally stumbling across the source of the problem.
Here is a more preferred sequence of events:
router A goes down, and nagios has the same queue as above.
1) nagios checks the host and finds it apparently DOWN.
2) Instead of sending an alert immediately, nagios temporarily suspends
its queue (or reorders it)
3) nagios then checks each host along the dependency tree
4) nagios discovers that router A is actually DOWN and sends an alert
5) the host is revised to UNREACHABLE status
6) all other hosts behind router A become marked as UNREACHABLE
7) nagios resumes its normal queue, perhaps revising it to exclude the
checks it just performed (unless it was reordered)
If the latter scenario occurred, I would receive only one DOWN alert and
3 UNREACHABLE alerts, instead of 3 DOWN alerts and one UNREACHABLE
alert. If I have nagios configured only to send emails and whatever for
just DOWN alerts, I would prefer the latter scenario so that I receive
only one alert instead of three.
If nagios 2.0 already has this feature, then my apologies for bringing
this issue up. If this scenario is occurring for another reason and
nagios is supposed to function this way, then I invite someone to
contact me and help me resolve the issue. If not, I would like to
discuss with anyone interested and possibly request it as a feature in
future releases of nagios.
References:
https://sourceforge.net/mailarchive/message.php?msg_id=11127907
https://sourceforge.net/mailarchive/message.php?msg_id=11128758
https://sourceforge.net/mailarchive/message.php?msg_id=11132179
--
Shawn Iverson
Technology Associate
MCP W2K3S and W2KP, Linux+, Network+, A+
New Castle Community School Corporation
shawn at nccsc.k12.in.us
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
More information about the Developers
mailing list