Host Check Clarification
Kevin Miller
kmiller at inflow.com
Tue Oct 22 18:42:48 CEST 2002
I have already been experimenting with the max_check_attempts over the past
week. I initially had it set to 3 but now modified it to 10. It is better
now but still not ideal. The outage times that I am trying to work with
range from 1 to 4 mins.
In response to Darren,
All devices that I am monitoring do have parent devices defined and this
does help to limit the number of notifications.
I have already thought about the multiple notification option using
notification_interval and escalations but doing this is not ideal in all
situations.
I think that the best answer might be disabling host notifications and
adding a ping service check. With this I could rely on purely service
notifications which can be controlled properly.
Does anyone agree with me that hosts checks should be able to be controlled
differently?
-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org]
Sent: Tuesday, October 22, 2002 10:16 AM
To: 'Kevin Miller'; Bishop, Dean
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification
no, there is a max_check_attempts option. you could increase this and/or
change the check_command to something that is a bit less sensitive.
what sort of time is involved in "temporarily unavailable"?
please glance at the documentation.
-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Tuesday, October 22, 2002 11:51 AM
To: 'Bishop, Dean'; Kevin Miller
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification
So there really is no way to keep from being notified on temporary network
outages? This seems to me to be a very important oversight in the
development of Nagios.
If hosts could be setup with retry_check_intervals this problem would be
solved. Possibly host checks could occur after each service check which
already support retry_check_intervals...Once the retry_check_inteval
expires, both the host and service could be assumed to be down.
-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org]
Sent: Tuesday, October 22, 2002 9:38 AM
To: 'Kevin Miller'; Bishop, Dean
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification
Yep, you are quite right. i missed that in my explaination. attached is a
flowchart i did for a presentation that should clear things up. if not, let
me know.
later,
dean
-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Tuesday, October 22, 2002 11:01 AM
To: 'Bishop, Dean'; Kevin Miller
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification
Though Nagios should work as you explain it, I believe that it works
differently.
In the documentation it states:
http://nagios.sourceforge.net/docs/1_0/statetypes.html
>>"Hard States
>>Hard states occur for services in the following situations (hard host
>>states are discussed later)...
>>When a service check results in a non-OK state and it has been (re)checked
>>the number of times specified by the <max_check_attempts> option in the
>>service definition. This is a hard error state.
>>When a service recovers from a hard error state. This is considered to be
>>a hard recovery.
>>When a service check results in a non-OK state and its corresponding host
>>is either DOWN or UNREACHABLE. This is an exception to the general
>>monitoring logic, but makes perfect sense. If the host isn't up why should
>>we try and recheck the service?
>>Hard states occur for hosts in the following situations...
>>When a host check results in a non-OK state and it has been (re)checked
>>the number of times specified by the <max_check_attempts> option in the
>>host definition. This is a hard error state.
>>When a host recovers from a hard error state. This is considered to be a
>>hard recovery."
>From this, it looks to me like Nagios ignores the retry_check_interval if
the host is down. The retry_check_interval is only used for services as
long as the host appears to be up.
Please let me know if I am missing something. If Nagios is really working
the way that I am thinking, that is not good, host checks should have
similar behavior to service checks.
-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org]
Sent: Tuesday, October 22, 2002 6:06 AM
To: 'Kevin Miller'
Cc: 'nagios-users at lists.sourceforge.net'
Subject: RE: [Nagios-users] Host Check Clarification
The way to do this is to use the retry_check_interval and
max_check_attempts.
upon failure (this applies to both services and hosts) the
normal_check_interval is not used. Rather the retry_check_interval is used.
The host/service will not become hard Non-OK until/unless max_check_attempts
is reached without getting an OK result.
so to avoid notifications for temporary outages, retry [max_check_attempts]
times every [retry_check_interval] minutes.
hope this helps,
dean
-----Original Message-----
From: Kevin Miller [mailto:kmiller at inflow.com]
Sent: Monday, October 21, 2002 6:53 PM
To: 'Bishop, Dean'
Subject: RE: [Nagios-users] Host Check Clarification
Thanks, that is what I assumed. What I am actually looking for is a way to
suppress host down alerts from notifying me so quickly. I am monitoring
hosts across the internet and therefore cannot control everything. Very
often there will be a temporary routing problem that will clear up after 1
or 2 mins. I would like nagios to keep trying for a few mins before paging
me.
Any ideas?
Thanks
-----Original Message-----
From: Bishop, Dean [mailto:dean.bishop at tcdsb.org]
Sent: Monday, October 21, 2002 3:03 PM
To: 'Kevin Miller '; 'nagios-users at lists.sourceforge.net '
Subject: RE: [Nagios-users] Host Check Clarification
i am away from my docs right now but here is how it works.
if the a service check, any service check (this would include the first of
many) returns a Non-OK status, then the host is checked.
if the host checks OK, then the services are scheduled for check using the
service's check_retry_interval. If the service stays Non-OK until
max_attempts, then the service notification is sent.
if the host check is Non-OK, then the host is pounded. If it stays Non-OK
until max_attempts (for the host) then the host notification is sent.
under both of these circumstances the service is now rescheduled at its
normal_check_interval.
the difference is that if the host is down, then service notifications are
squelched.
later,
dean
-----Original Message-----
From: Kevin Miller
To: nagios-users at lists.sourceforge.net
Sent: 10/21/2002 4:14 PM
Subject: [Nagios-users] Host Check Clarification
Looking for some clarification on Nagios Host checking. I am monitoring
the SSH service on multiple hosts, from what I understand when the SSH
service check has problems, Nagios then tries to do a Host check.
>From the documentation
"One instance where Nagios checks the status of a host is when a service
check results in a non-OK status. Nagios checks the host to decide
whether or not the host is up, down, or unreachable. If the first host
check returns a non-OK state, Nagios will keep pounding out checks of
the host until either (a) the maximum number of host checks (specified
by the max_attempts option in the host definition) is reached or (b) a
host check results in an OK state. "
The documentation states that Nagios dedicates all resources to checking
this host and then sends a notification that the host is down. The part
that seems a little strange to me is that often I will get a Host Down
notification while Nagios is still doing test 1 out of 3 for the SSH
service. I have my max_attempts set to 10 for each host, what is the
interval between these attempts?. Is there anyway to tell Nagios to
perform host checks that are a certain interval apart (just like in
service checks) before sending a notification?
Thanks
-------------------------------------------------------
This sf.net emial is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ad.doubleclick.net/clk;4699841;7576301;v?http://www.sun.com/javavote
More information about the Users
mailing list