What the...

Russell Scibetti russell at quadrix.com
Thu Oct 10 21:27:33 CEST 2002


Here's what I think...

The reason you see the two host checks before the host checks isn't that 
a service check didn't occur before the host check, but it wasn't 
logged.  The service check occurred and came back with a non-OK.  This 
means the service was in SOFT non-OK state.  For some reason, nagios 
didn't log this (Here might be the real problem in all this - Ethan, 
would that service fail get logged if the host_check that followed also 
failed).  My guess is that since the follow-up host checks (to see if 
the host is the problem, not the service) both failed, the initial 
service check fail didn't get into the log.  But for the host check to 
even occur, the service had to be checked.

Now at the next scheduled normal_check_interval, the service is again 
checked.  If you have aggressive_host_checking on and the service failed 
again, I believe the host will also get checked again.  BTW, are you 
using aggressive_host_checking?

My educated guess in all this is that its not a check exection problem, 
but just the failure of the first service-check fail (the soft state 
not-OK) to get entered into the log since the host also failed.

-Russell
 
Bishop, Dean wrote:

> yes, this is interesting.
>
>  
>
> honestly i haven't even touched servicedependencies.
>
>  
>
> that being said, the snippet that i sent was a grep -n 3000 of the 
> nagios.log (for testserver).  On the second line you can see that the 
> service is OK.  There is no mention of the service until _after_ two 
> host checks (two is my host max_check).  Why was the host checked to 
> begin with here?  And why then is the service checked?  Perhaps, as 
> you suggest, as part of it's normal_check_interval....perhaps.
>
>  
>
> i'm soooo confused.
>
>  
>
> on the last few lines Nagios does what i would have expected.
>
>  
>
> confused in configs,
>
> dean
>
>     -----Original Message-----
>     From: Russell Scibetti [mailto:russell at quadrix.com]
>     Sent: Thursday, October 10, 2002 2:41 PM
>     To: Bishop, Dean
>     Cc: 'nagios-users at lists.sourceforge.net'
>     Subject: Re: [Nagios-users] RE: What the...
>
>     The only time nagios will stop doing service checks at the
>     normal_check_interval for that service is if that service has a
>     servicedependency that's execution failure criteria is true.
>
>     Otherwise, service checks will continue as planned.  The way
>     nagios knows that a host has come back up is if any service on
>     that host has recovered to OK.  While a host and its services are
>     down, when a service check occurs, it won't go through all the
>     retries (already in a hard state - no need to retry), but it will
>     check the service once,
>
>     Also, do you have aggressive_host_checking enabled in your
>     nagios.cfg?  The only reason I can guess that the host check is
>     also occurring when the service check occurs is that you have that
>     setting enabled.  Otherwise a host will only get checked after the
>     first service check failure (when the host is still up).
>
>     Hope this helps.
>
>     -Russell
>
>     Bishop, Dean wrote:
>
>>     First, sorry bout the subject i realize that it is
>>     inappropriate.  it does, however capture my initial response.
>>
>>     We are in the midst of many nightmares concurrently: smoking
>>     servers, irreplaceable data lost, network latency, cold lunch,
>>     sore finger, you know the whole gambut at once.
>>
>>     apologies to all.
>>
>>     here is another entry from my logs.  Each host is dependant on
>>     the previously numbered host (e.g.
>>     Marshall-McLuhan-0561SW2A_4-HS7 is the parent of
>>     Marshall-McLuhan-0561SW2A_5-HS7 who is the parent of
>>     Marshall-McLuhan-0561SW2A_6-HS7, etc.
>>
>>     why, once Marshall-McLuhan-0561SW2A_14-HS7 is determined to be
>>     UNREACHABLE (due to the failure of
>>     Marshall-McLuhan-0561SW2A_4-HS7), is the service checked on
>>     Marshall-McLuhan-0561SW2A_14-HS7?
>>
>>
>>
>>     [1034172479] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_14-HS7;DOWN;SOFT;1;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172516] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_7-HS7;DOWN;SOFT;1;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172552] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_6-HS7;DOWN;SOFT;1;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172588] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_5-HS7;DOWN;SOFT;1;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172624] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_4-HS7;DOWN;SOFT;1;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172644] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_4-HS7;DOWN;HARD;2;CRITICAL - Plugin
>>     timed out after 18 seconds
>>     [1034172644] HOST NOTIFICATION:
>>     nagiosadmin;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172645] HOST NOTIFICATION:
>>     Marco;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172645] HOST NOTIFICATION:
>>     Kevin-NonCritical;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;notify-by-epager;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172645] HOST NOTIFICATION:
>>     Kevin;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172646] HOST NOTIFICATION:
>>     Keith-NonCritical;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;notify-by-epager;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172646] HOST NOTIFICATION:
>>     Keith;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172646] HOST NOTIFICATION:
>>     Ben;Marshall-McLuhan-0561SW2A_4-HS7;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 18 seconds
>>     [1034172647] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_5-HS7;UNREACHABLE;HARD;2;CRITICAL -
>>     Plugin timed out after 18 seconds
>>     [1034172647] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_6-HS7;UNREACHABLE;HARD;2;CRITICAL -
>>     Plugin timed out after 18 seconds
>>     [1034172647] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_7-HS7;UNREACHABLE;HARD;2;CRITICAL -
>>     Plugin timed out after 18 seconds
>>     [1034172647] HOST ALERT:
>>     Marshall-McLuhan-0561SW2A_14-HS7;UNREACHABLE;HARD;2;CRITICAL -
>>     Plugin timed out after 18 seconds
>>     [1034172647] SERVICE ALERT: Marshall-McLuhan-0561SW2A_14-HS7;Port
>>     Check-23;CRITICAL;HARD;1;Socket timeout after 10 seconds
>>
>>
>>     -----Original Message-----
>>     From: Bishop, Dean
>>     Sent: Thursday, October 10, 2002 1:04 PM
>>     To: ' nagios-users at lists.sourceforge.net '
>>     Subject: What the *&#( !!
>>     Importance: High
>>
>>
>>     Can someone explain this to me??
>>
>>
>>     why in the world is the service for testserver01.tcdsb.org being
>>     checked after the host has been determined down?
>>     also why is the host being checked before the service??
>>
>>
>>
>>
>>     [root at NMS var]# tail nagios.log -n 3000 |grep testserver01
>>
>>     [1034266896] HOST ALERT: testserver01.tcdsb.org;UP;HARD;1;(Host
>>     assumed to be up)
>>     [1034266896] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers -
>>     Port Check 135;OK;HARD;1;TCP OK - 0 second response time on port 135
>>     [1034267924] HOST ALERT:
>>     testserver01.tcdsb.org;DOWN;SOFT;1;CRITICAL - Plugin timed out
>>     after 8 seconds
>>     [1034267933] HOST ALERT:
>>     testserver01.tcdsb.org;DOWN;HARD;2;CRITICAL - Plugin timed out
>>     after 8 seconds
>>     [1034267933] HOST
>>     NOTIFICATION:nagiosadmin;testserver01.tcdsb.org;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 8 seconds
>>     [1034267934] HOST
>>     NOTIFICATION:Keith;testserver01.tcdsb.org;DOWN;host-notify-by-email;CRITICAL
>>     - Plugin timed out after 8 seconds
>>     [1034267934] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers -
>>     Port Check 135;CRITICAL;HARD;1;Socket timeout after 2 seconds
>>     [1034268938] HOST ALERT: testserver01.tcdsb.org;UP;HARD;1;PING OK
>>     - Packet loss = 0%, RTA = 0.61 ms
>>     [1034268938] HOST
>>     NOTIFICATION:nagiosadmin;testserver01.tcdsb.org;UP;host-notify-by-email;PING
>>     OK - Packet loss = 0%, RTA = 0.61 ms
>>     [1034268938] HOST
>>     NOTIFICATION:Keith;testserver01.tcdsb.org;UP;host-notify-by-email;PING
>>     OK - Packet loss = 0%, RTA = 0.61 ms
>>     [1034268938] SERVICE ALERT: testserver01.tcdsb.org;Misc Servers -
>>     Port Check 135;OK;HARD;1;TCP OK - 0 second response time on port 135
>>
>>     [root at NMS var]#
>>
>
>-- 
>Russell Scibetti
>Quadrix Solutions, Inc.
>http://www.quadrix.com
>(732) 235-2335, ext. 7038
>
>

-- 
Russell Scibetti
Quadrix Solutions, Inc.
http://www.quadrix.com
(732) 235-2335, ext. 7038


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20021010/b29fe4b0/attachment.html>


More information about the Users mailing list