<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-7">
<META content="MSHTML 6.00.2800.1400" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><SPAN class=173050219-08042004><FONT face=Arial color=#0000ff size=2>The
problem with having a high host max check attempts is that ALL other functions
stop during this particular type of check. This is by design, since Nagios
has to determine how widespread an outage is, as it walks the dependency
tree using host checks. The monitoring queue gets pushed back, and
everyone's check latency suffers. With a lower max_check_attempt, you'll
know sooner when the bad link is acting up (in the console),
and notification spam can be tuned (using escalations) if
needed. In the end, whatever works well for you is what you should
do. Our shop has had good results with this setup.</FONT></SPAN></DIV>
<DIV><SPAN class=173050219-08042004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=173050219-08042004><FONT face=Arial color=#0000ff
size=2>Another avenue to look into is defining a more lenient check-host-alive
command. Our company manages some networks throughout the world
with very poor links, ping times often exceeding 4000ms ( 4 seconds!! ).
So, we cloned the check-host-alive into check-host-alive2, depending on their
ISP line quality.</FONT></SPAN></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader dir=ltr align=left><FONT face=Tahoma
size=2>-----Original Message-----<BR><B>From:</B> Anastasios Zafeiropoulos
[mailto:mls@freemail.gr]<BR><B>Sent:</B> Thursday, April 08, 2004 10:33
AM<BR><B>To:</B> Tedman Eng; nagios-users<BR><B>Subject:</B> Re:
[Nagios-users] Dependency problem<BR><BR></FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Mr Tedman, thank you very much for your response,
throughout this kind of flame.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>I will disagree with you regarding
the max_check_attempts = 30. This is tested and works as it should work.
When the RT3 is unreachable or down, it will start pinging with a 30 limit
countdouwn. When it will end with no success, it will jump to its parent. And
so on...</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>But I think that you gave me a new kick start
with the escalations thing. I 'd better go read a liitle bit more the
documentation and see if this option works for my case!</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Thnks again</FONT></DIV>
<BLOCKQUOTE dir=ltr
style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #000000 2px solid; MARGIN-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A title=teng@dataway.com href="mailto:teng@dataway.com">Tedman Eng</A>
</DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A title=mls@freemail.gr
href="mailto:mls@freemail.gr">'Anastasios Zafeiropoulos'</A> ; <A
title=nagios-users@lists.sourceforge.net
href="mailto:nagios-users@lists.sourceforge.net">nagios-users</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Thursday, April 08, 2004 11:15
AM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> RE: [Nagios-users] Dependency
problem</DIV>
<DIV><BR></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>Try lowering the host max_check_attempts. When nagios detects a
service is bad, it'll hostcheck each parent up the
tree and will not do ANYTHING for the 30 check attempts you've
set while it tries to determine whether RT1, RT2, and/or RT3 is
down. This can adversely affect your other monitored devices if those
links are always flapping. It's better to monitor faster and make
notifications slower than to slow down the entire monitoring.
<SPAN class=806084508-08042004><FONT face=Arial color=#0000ff size=2>The
host will show up in the console as up/down/flapping a lot, which
is its true state. You can artificially slow down
notifications by using escalations.</FONT></SPAN></FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>For example:</FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>set notification interval to 5</FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>set no contact for the normal notification (use the escalation
instead)</FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>set the escalation to notify starting at alert
#2</FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2>This would in effect make it so the device would have to be down for
a full 5 minutes before you get notified.</FONT></SPAN></DIV>
<DIV><SPAN class=806084508-08042004><FONT face=Arial color=#0000ff
size=2></FONT></SPAN> </DIV>
<DIV><FONT face=Tahoma><FONT size=2><SPAN class=806084508-08042004><FONT
face=Arial color=#0000ff></FONT></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=Tahoma><FONT size=2><SPAN
class=806084508-08042004></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=Tahoma><FONT size=2><SPAN
class=806084508-08042004> </SPAN>-----Original
Message-----<BR><B>From:</B> Anastasios Zafeiropoulos
[mailto:mls@freemail.gr]<BR><B>Sent:</B> Wednesday, April 07, 2004 12:59
PM<BR><B>To:</B> nagios-users<BR><B>Subject:</B> [Nagios-users] Dependency
problem<BR><BR></DIV></FONT></FONT>
<BLOCKQUOTE>
<DIV><FONT face=Arial size=2>Hello world,</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial size=2>I'm having trouble with a Host dependency
misconfiguration or why not, with a bug in Nagios' Dependency logic
process and </FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial size=2>notification.</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial size=2>I am using version nagios-1.2-0.rhfc1.dag
which was a prebuilt package from Dag Apt repository
site.<BR>===================================================<BR>My
Topology:<BR>===================================================</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial size=2>Nagios machine --- RT1 -- RT2 -- RT3
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV><FONT size=2>
<DIV><FONT face=Arial></FONT><BR><FONT
face=Arial>====================================================<BR>The
problem<BR>====================================================</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>When RT1 goes down, or the RT1-RT2 Link goes down,
Nagios will notice that at random, while he is checkong a service or
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>HOST_ALIVE function to any part of the network that
is down. Let's assume that the first Host that Nagios found dead was RT3.
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>Nagios didn't get any reply from RT3, so RT3 will be
kept in SOFT down state. </FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>Next the RETRY proccess will take place. The
max_check_attempts are 30 for each host. That's because the links are not
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>reliable at all so we want to be a little elastic
with the Notifications.</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>At the time that we reach the Retry #30, Nagios
assumes that RT3 IS DOWN, puts it in HARD DOWN state and looks to find any
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>dependencies associated with the RT3. If you look
below, RT3 is dependent upon RT2. So it will continue with try pinging
RT2.</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>While Nagios is trying to determine whether the RT2
is alive or not, suddendly, the RT1-RT2 link comes up and all the network
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>is now reachable by Nagios. I notice here that the
max_checks_attempts havent timed out. So Nagios will take a response from
</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT face=Arial>RT2 and it will put it in A HARD OK
State.</FONT></DIV>
<DIV><FONT face=Arial></FONT> </DIV>
<DIV><FONT
face=Arial></FONT></DIV></FONT></BLOCKQUOTE></BLOCKQUOTE></BLOCKQUOTE></BODY></HTML>