Hierarchical host schedule queuing

Shawn Iverson shawn at nccsc.k12.in.us
Fri Mar 11 01:38:33 CET 2005


Greetings!

While simulating a network failure to test my nagios setup, I noticed
that nagios (using version 1.2) does not hierarchically proceed to check
upstream hosts following when it concludes that a host is down hard.  

For example, take this theoretical scenario.  My nagios is in a network
of five hundred monitored nodes, spread across many subnets and WAN
links, and nagios is connected to the core router.  A peer router
(Router A) goes down one hop away from the core router.  Many more
routers and devices that are being monitored are downstream from this
router at various hops.  Nagios discovers that it cannot reach Server A
located 3 hops away on the other side of the downed router, so it sends
a notification of the server being down.

What actually happens next is that nagios continues down its predermined
scheduling queue, finding more devices down (or unreachable, if it has
found it to be behind a downed host it has already discovered) behind
this router.  Note that it has not yet discovered the router itself, and
the source of the problem, to be down because it is much further down in
the scheduling queue in this scenario.

What would perhaps would be more efficient in terms of outage discovery
would happen as follows.  Server A is discovered to be down, but nagios
witholds sending an alert for the moment.  It halts its normal
scheduling queue and begins a temporary hierarchical scheduling queue,
scheduling the hosts between nagios and the suspect server, starting
with the closest one and ending with the farthest one and taking into
account redundant links.  It then processes this queue, discovers that
Router A is the actual problem sees that no other path exists to Server
A, sends an alert for Router A, and revises the alert for Server A as
unreachable.  It then revises its normal queue to exclude the hosts just
checked hierarchically and proceeds normally.  Anything else found
behind Router A is henceforth correctly marked as unreachable.  In fact,
everything can logically be determined to be unreachable behind Router A
after such a test and can then be updated instantly.

In my real network, I set up nagios to send me alerts only on hard
downed hosts and recovered hosts.  When I simulated a failure of the
large network at the core router (I unplugged nagios from the core
router; my nagios box is multihomed to a dedicated Internet link for
alerts, BTW), I received about 30 host down alerts before the core
router outage was discovered over 45 minutes later.  I could have
received the problem alert from the source of the problem much sooner
with hierarchical queuing.


If a newer version of nagios already supports this, then great!  If not,
perhaps I can assist in creating the necessary code to make this extra
logic possible.  This program is too good not to have such a feature.



--

Shawn Iverson
Technology Associate
MCP W2K3S and W2KP, Linux+, Network+, A+
New Castle Community School Corporation
shawn at nccsc.k12.in.us

                __ (.)(.) __            
***************(||)**()**(||)***************
  Please leave the subject line blank when 
    sending me new emails that originate  
   outside of nccsc.k12.in.us (to prevent 
     dropped messages during delivery).
********************************************


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list