What the...
Bishop, Dean
dean.bishop at tcdsb.org
Thu Oct 10 21:45:28 CEST 2002
yeah, i am thinking along the same lines right now.
no, aggressive checks are not enabled.
i just was reading about service_reaper. Perhaps someone can tell me a bit
more about it but is it possible that, as we are presuming, service checks
fail but the results aren't entered into the log until the reaper runs?
What this means is that some other portion of Nagios knows about the service
failure, causing the host checks to be initiated.
i need clarification now though on these:
If a service fails once of many possible attempts (max_checks>=2), then do
the host checks run without interuption from any other checks?
If a service fails once of many possible attempts (max_checks>=2), then are
the remaining service checks run? If so, when? Immediately or on the next
normal_check_interval??
thanks all,
dean
-----Original Message-----
From: Russell Scibetti [mailto:russell at quadrix.com]
Sent: Thursday, October 10, 2002 3:28 PM
To: Bishop, Dean
Cc: 'nagios-users at lists.sourceforge.net'
Subject: Re: [Nagios-users] RE: What the...
Here's what I think...
The reason you see the two host checks before the host checks isn't that a
service check didn't occur before the host check, but it wasn't logged. The
service check occurred and came back with a non-OK. This means the service
was in SOFT non-OK state. For some reason, nagios didn't log this (Here
might be the real problem in all this - Ethan, would that service fail get
logged if the host_check that followed also failed). My guess is that since
the follow-up host checks (to see if the host is the problem, not the
service) both failed, the initial service check fail didn't get into the
log. But for the host check to even occur, the service had to be checked.
Now at the next scheduled normal_check_interval, the service is again
checked. If you have aggressive_host_checking on and the service failed
again, I believe the host will also get checked again. BTW, are you using
aggressive_host_checking?
My educated guess in all this is that its not a check exection problem, but
just the failure of the first service-check fail (the soft state not-OK) to
get entered into the log since the host also failed.
-Russell
Bishop, Dean wrote:
yes, this is interesting.
honestly i haven't even touched servicedependencies.
that being said, the snippet that i sent was a grep -n 3000 of the
nagios.log (for testserver). On the second line you can see that the
service is OK. There is no mention of the service until _after_ two host
checks (two is my host max_check). Why was the host checked to begin with
here? And why then is the service checked? Perhaps, as you suggest, as
part of it's normal_check_interval....perhaps.
i'm soooo confused.
on the last few lines Nagios does what i would have expected.
confused in configs,
dean
-----Original Message-----
From: Russell Scibetti [ mailto:russell at quadrix.com
<mailto:russell at quadrix.com> ]
Sent: Thursday, October 10, 2002 2:41 PM
To: Bishop, Dean
Cc: ' nagios-users at lists.sourceforge.net
<mailto:nagios-users at lists.sourceforge.net> '
Subject: Re: [Nagios-users] RE: What the...
The only time nagios will stop doing service checks at the
normal_check_interval for that service is if that service has a
servicedependency that's execution failure criteria is true.
Otherwise, service checks will continue as planned. The way nagios knows
that a host has come back up is if any service on that host has recovered to
OK. While a host and its services are down, when a service check occurs, it
won't go through all the retries (already in a hard state - no need to
retry), but it will check the service once,
Also, do you have aggressive_host_checking enabled in your nagios.cfg? The
only reason I can guess that the host check is also occurring when the
service check occurs is that you have that setting enabled. Otherwise a
host will only get checked after the first service check failure (when the
host is still up).
Hope this helps.
-Russell
Bishop, Dean wrote:
First, sorry bout the subject i realize that it is inappropriate. it does,
however capture my initial response.
We are in the midst of many nightmares concurrently: smoking servers,
irreplaceable data lost, network latency, cold lunch, sore finger, you know
the whole gambut at once.
apologies to all.
here is another entry from my logs. Each host is dependant on the
previously numbered host (e.g. Marshall-McLuhan-0561SW2A_4-HS7 is the parent
of Marshall-McLuhan-0561SW2A_5-HS7 who is the parent of
Marshall-McLuhan-0561SW2A_6-HS7, etc.
why, once Marshall-McLuhan-0561SW2A_14-HS7 is determined to be UNREACHABLE
(due to the failure of Marshall-McLuhan-0561SW2A_4-HS7), is the service
checked on Marshall-McLuhan-0561SW2A_14-HS7?
[1034172479] HOST ALERT:
Marshall-McLuhan-0561SW2A_14-HS7;DOWN;SOFT;1;CRITICAL - Plugin timed out
after 18 seconds
[1034172516] HOST ALERT:
Marshall-McLuhan-0561SW2A_7-HS7;DOWN;SOFT;1;CRITICAL - Plugin timed out
after 18 seconds
[1034172552] HOST ALERT:
Marshall-McLuhan-0561SW2A_6-HS7;DOWN;SOFT;1;CRITICAL - Plugin timed out
after 18 seconds
[1034172588] HOST ALERT:
Marshall-McLuhan-0561SW2A_5-HS7;DOWN;SOFT;1;CRITICAL - Plugin timed out
after 18 seconds
[1034172624] HOST ALERT:
Marshall-McLuhan-0561SW2A_4-HS7;DOWN;SOFT;1;CRITICAL - Plugin timed out
after 18 seconds
[1034172644] HOST ALERT:
Marshall-McLuhan-0561SW2A_4-HS7;DOWN;HARD;2;CRITICAL - Plugin timed out
after 18 seconds
[1034172644] HOST NOTIFICATION:
nagiosadmin;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITIC
AL - Plugin timed out after 18 seconds
[1034172645] HOST NOTIFICATION:
Marco;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL -
Plugin timed out after 18 seconds
[1034172645] HOST NOTIFICATION:
Kevin-NonCritical;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;notify-by-epager;CRIT
ICAL - Plugin timed out after 18 seconds
[1034172645] HOST NOTIFICATION:
Kevin;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL -
Plugin timed out after 18 seconds
[1034172646] HOST NOTIFICATION:
Keith-NonCritical;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;notify-by-epager;CRIT
ICAL - Plugin timed out after 18 seconds
[1034172646] HOST NOTIFICATION:
Keith;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL -
Plugin timed out after 18 seconds
[1034172646] HOST NOTIFICATION:
Ben;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL -
Plugin timed out after 18 seconds
[1034172647] HOST ALERT:
Marshall-McLuhan-0561SW2A_5-HS7;UNREACHABLE;HARD;2;CRITICAL - Plugin timed
out after 18 seconds
[1034172647] HOST ALERT:
Marshall-McLuhan-0561SW2A_6-HS7;UNREACHABLE;HARD;2;CRITICAL - Plugin timed
out after 18 seconds
[1034172647] HOST ALERT:
Marshall-McLuhan-0561SW2A_7-HS7;UNREACHABLE;HARD;2;CRITICAL - Plugin timed
out after 18 seconds
[1034172647] HOST ALERT:
Marshall-McLuhan-0561SW2A_14-HS7;UNREACHABLE;HARD;2;CRITICAL - Plugin timed
out after 18 seconds
[1034172647] SERVICE ALERT: Marshall-McLuhan-0561SW2A_14-HS7;Port
Check-23;CRITICAL;HARD;1;Socket timeout after 10 seconds
-----Original Message-----
From: Bishop, Dean
Sent: Thursday, October 10, 2002 1:04 PM
To: ' <mailto:nagios-users at lists.sourceforge.net>
nagios-users at lists.sourceforge.net '
Subject: What the *&#( !!
Importance: High
Can someone explain this to me??
why in the world is the service for testserver01.tcdsb.org being checked
after the host has been determined down?
also why is the host being checked before the service??
[root at NMS var]# tail nagios.log -n 3000 |grep testserver01
[1034266896] HOST ALERT: testserver01.tcdsb.org;UP;HARD;1;(Host assumed to
be up)
[1034266896] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers - Port Check
135;OK;HARD;1;TCP OK - 0 second response time on port 135
[1034267924] HOST ALERT: testserver01.tcdsb.org;DOWN;SOFT;1;CRITICAL -
Plugin timed out after 8 seconds
[1034267933] HOST ALERT: testserver01.tcdsb.org;DOWN;HARD;2;CRITICAL -
Plugin timed out after 8 seconds
[1034267933] HOST
NOTIFICATION:nagiosadmin;testserver01.tcdsb.org;DOWN;host-notify-by-email;CR
ITICAL - Plugin timed out after 8 seconds
[1034267934] HOST
NOTIFICATION:Keith;testserver01.tcdsb.org;DOWN;host-notify-by-email;CRITICAL
- Plugin timed out after 8 seconds
[1034267934] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers - Port Check
135;CRITICAL;HARD;1;Socket timeout after 2 seconds
[1034268938] HOST ALERT: testserver01.tcdsb.org;UP;HARD;1;PING OK - Packet
loss = 0%, RTA = 0.61 ms
[1034268938] HOST
NOTIFICATION:nagiosadmin;testserver01.tcdsb.org;UP;host-notify-by-email;PING
OK - Packet loss = 0%, RTA = 0.61 ms
[1034268938] HOST
NOTIFICATION:Keith;testserver01.tcdsb.org;UP;host-notify-by-email;PING OK -
Packet loss = 0%, RTA = 0.61 ms
[1034268938] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers - Port Check
135;OK;HARD;1;TCP OK - 0 second response time on port 135
[root at NMS var]#
--
Russell Scibetti
Quadrix Solutions, Inc.
http://www.quadrix.com <http://www.quadrix.com>
(732) 235-2335, ext. 7038
--
Russell Scibetti
Quadrix Solutions, Inc.
http://www.quadrix.com <http://www.quadrix.com>
(732) 235-2335, ext. 7038
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20021010/bb9b98ab/attachment.html>
More information about the Users
mailing list