Host Check Clarification

Kevin Miller kmiller at inflow.com
Tue Oct 22 18:42:48 CEST 2002


I have already been experimenting with the max_check_attempts over the past
week.  I initially had it set to 3 but now modified it to 10.  It is better
now but still not ideal.  The outage times that I am trying to work with
range from 1 to 4 mins.  

In response to Darren,
All devices that I am monitoring do have parent devices defined and this
does help to limit the number of notifications.  

I have already thought about the multiple notification option using
notification_interval and escalations but doing this is not ideal in all
situations.  

I think that the best answer might be disabling host notifications and
adding a ping service check.  With this I could rely on purely service
notifications which can be controlled properly.       

Does anyone agree with me that hosts checks should be able to be controlled
differently?

-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org] 
Sent: Tuesday, October 22, 2002 10:16 AM
To: 'Kevin Miller'; Bishop, Dean
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification

no, there is a max_check_attempts option.  you could increase this and/or
change the check_command to something that is a bit less sensitive.

what sort of time is involved in "temporarily unavailable"?

please glance at the documentation.



-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Tuesday, October 22, 2002 11:51 AM
To: 'Bishop, Dean'; Kevin Miller
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification


So there really is no way to keep from being notified on temporary network
outages?  This seems to me to be a very important oversight in the
development of Nagios.  

If hosts could be setup with retry_check_intervals this problem would be
solved.  Possibly host checks could occur after each service check which
already support retry_check_intervals...Once the retry_check_inteval
expires, both the host and service could be assumed to be down.  

   

-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org] 
Sent: Tuesday, October 22, 2002 9:38 AM
To: 'Kevin Miller'; Bishop, Dean
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification

Yep, you are quite right.  i missed that in my explaination.  attached is a
flowchart i did for a presentation that should clear things up.  if not, let
me know.

later,
dean



-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Tuesday, October 22, 2002 11:01 AM
To: 'Bishop, Dean'; Kevin Miller
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification


Though Nagios should work as you explain it, I believe that it works
differently.  

In the documentation it states:  
http://nagios.sourceforge.net/docs/1_0/statetypes.html
>>"Hard States 

>>Hard states occur for services in the following situations (hard host
>>states are discussed later)... 


>>When a service check results in a non-OK state and it has been (re)checked
>>the number of times specified by the <max_check_attempts> option in the
>>service definition. This is a hard error state. 

>>When a service recovers from a hard error state. This is considered to be
>>a hard recovery. 

>>When a service check results in a non-OK state and its corresponding host
>>is either DOWN or UNREACHABLE. This is an exception to the general
>>monitoring logic, but makes perfect sense. If the host isn't up why should
>>we try and recheck the service? 

>>Hard states occur for hosts in the following situations... 

>>When a host check results in a non-OK state and it has been (re)checked
>>the number of times specified by the <max_check_attempts> option in the
>>host definition. This is a hard error state. 

>>When a host recovers from a hard error state. This is considered to be a
>>hard recovery."

>From this, it looks to me like Nagios ignores the retry_check_interval if
the host is down.  The retry_check_interval is only used for services as
long as the host appears to be up.  

Please let me know if I am missing something.  If Nagios is really working
the way that I am thinking, that is not good, host checks should have
similar behavior to service checks.  




-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org] 
Sent: Tuesday, October 22, 2002 6:06 AM
To: 'Kevin Miller'
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification

The way to do this is to use the retry_check_interval and
max_check_attempts.

upon failure (this applies to both services and hosts) the
normal_check_interval is not used.  Rather the retry_check_interval is used.
The host/service will not become hard Non-OK until/unless max_check_attempts
is reached without getting an OK result.

so to avoid notifications for temporary outages, retry [max_check_attempts]
times every [retry_check_interval] minutes.

hope this helps,
dean

-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Monday, October 21, 2002 6:53 PM
To: 'Bishop, Dean'
Subject: RE: [Nagios-users] Host Check Clarification


Thanks, that is what I assumed.  What I am actually looking for is a way to
suppress host down alerts from notifying me so quickly.  I am monitoring
hosts across the internet and therefore cannot control everything.  Very
often there will be a temporary routing problem that will clear up after 1
or 2 mins.  I would like nagios to keep trying for a few mins before paging
me.  


Any ideas?

Thanks

-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org] 
Sent: Monday, October 21, 2002 3:03 PM
To: 'Kevin Miller '; 'nagios-users at lists.sourceforge.net '
Subject: RE: [Nagios-users] Host Check Clarification



i am away from my docs right now but here is how it works.


if the a service check, any service check (this would include the first of
many) returns a Non-OK status, then the host is checked.

if the host checks OK, then the services are scheduled for check using the
service's check_retry_interval.  If the service stays Non-OK until
max_attempts, then the service notification is sent.

if the host check is Non-OK, then the host is pounded.  If it stays Non-OK
until max_attempts (for the host) then the host notification is sent.

under both of these circumstances the service is now rescheduled at its
normal_check_interval.

the difference is that if the host is down, then service notifications are
squelched.



later,
dean


-----Original Message-----
From: Kevin Miller
To: nagios-users at lists.sourceforge.net
Sent: 10/21/2002 4:14 PM
Subject: [Nagios-users] Host Check Clarification

Looking for some clarification on Nagios Host checking.  I am monitoring
the SSH service on multiple hosts, from what I understand when the SSH
service check has problems, Nagios then tries to do a Host check.  
 
>From the documentation
"One instance where Nagios checks the status of a host is when a service
check results in a non-OK status. Nagios checks the host to decide
whether or not the host is up, down, or unreachable. If the first host
check returns a non-OK state, Nagios will keep pounding out checks of
the host until either (a) the maximum number of host checks (specified
by the max_attempts option in the host definition) is reached or (b) a
host check results in an OK state. "
 
The documentation states that Nagios dedicates all resources to checking
this host and then sends a notification that the host is down.  The part
that seems a little strange to me is that often I will get a Host Down
notification while Nagios is still doing test 1 out of 3 for the SSH
service.  I have my max_attempts set to 10 for each host, what is the
interval between these attempts?.  Is there anyway to tell Nagios to
perform host checks that are a certain interval apart (just like in
service checks) before sending a notification?  
 
 
Thanks
 
  
 
 


-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote




More information about the Users mailing list