checks, notifications don't work after time period exception
Mark Young
myoung at nagios.org
Mon Aug 25 17:43:42 CEST 2008
Hi Seth,
On Aug 25, 2008, at 8:05 AM, Seth Simmons wrote:
> We have a qa group overseas that will work on our customer sites
> during the US overnight. To avoid false alerts, I added a time
> exception so notifications are not sent out between 4am and 5:30am.
> The problem is, after the exception, Nagios (3.0.3) won’t send
> notifications, neither are checks performed for any sites with an
> exception. If a site is in a critical state either shortly after 4
> or (if they start early) right before 4, checks do not continue
> after 5:30. When I look at Nagios later, it shows it in critical
> and the last check was done at 3:58am with the next check at
> midnight the next day.
When I start dealing with time problems with Nagios I have a small
list that I try first just to it out.
* Check date/time of monitoring server and that it is the right
timezone (UTC or whatever you want it as).
* Check that the Nagios web interface is displaying the time you
expect it to (top left corner in most cgis). In the nagios.cfg you
may have set additional time information in there.
* Stop the nagios process, checking that there are are no other
running instances left. 'service nagios stop' 'ps aux |grep nagios'
* Restart the nagios process.
Sometimes you can get duplicate Nagios daemons running and they can
cause many odd problems like this. Also I hope we are not dealing
with any time translations with the "overseas" group.
>
> Let me give some more specific examples:
> Server-A is running abc.customer.com for us and our qa group takes
> the site down at 3:55am, before the 4am exception. Nagios will show
> as critical until either midnight the next day, or you force a check
> on the service. So, say at 8am I look at it, the service is
> critical with last check at 3:55am and next scheduled check at 12am
> tomorrow. When I force a check, it will continue on normal check
> schedule and send notice that the service is ok.
So you are saying that "Server-A" is supposed to be checked in the
timerange 24x7 minus 4:00am-5:30am each day, but when it stops at
4:00am it will not start checking until the next day, unless you force
it through an external command to start checking again? It is
possible that there could be a bug, but you seem to have a really
common timeperiod definition type. I normally suggest that users
always run the checks 24x7 and then just modify the notification
periods (like you did with 'Server-B). But I would try it with a
simple time definition first.
# Test timeperiod for the recycle service.
define timeperiod{
timeperiod_name recycle
alias recycle
sunday 00:00-04:00,05:30-24:00
monday 00:00-04:00,05:30-24:00
tuesday 00:00-04:00,05:30-24:00
wednesday 00:00-04:00,05:30-24:00
thursday 00:00-04:00,05:30-24:00
friday 00:00-04:00,05:30-24:00
saturday 00:00-04:00,05:30-24:00
}
Also what does your "generic-service" and "local-service" templates
look like? There could be some settings that are following you
through those templates. Also you may have modified some settings in
the nagios.cfg that makes changes to how nagios deals with time.
>
> Server-B is also running a site and tomcat is stopped at 4:10am.
> This service has notification period with the same time period with
> exceptions from 4am – 5:30am. After that it will not send
> notifications. At 8am it is still doing checks and saying is
> critical, but when looking at the details it says it has not sent
> any notifications. When I force a check it still won’t do it. If I
> restart Nagios then it does a check it will send first notice. I
> don’t see anything wrong with my time period so not sure where the
> issue is. Not sure if anyone else has noticed this before.
The difference between those are that they are using a different
service template. Server-B is using 'local-service'.
>
> Here is what I have for that time period and checks for the above
> examples:
>
> define timeperiod{
> timeperiod_name url-monitor
> alias url-monitor
> sunday 00:00-23:59
> monday 00:00-23:59
> tuesday 00:00-23:59
> wednesday 00:00-23:59
> thursday 00:00-23:59
> friday 00:00-23:59
> saturday 00:00-23:59
> exclude recycle
> }
This is how I would have wrote the timeperiod definitions to make them
more clear. I've used the exclude method many times so I am sure that
it works as you are expecting.
define timeperiod{
timeperiod_name 24x7
alias 24 Hours A Day, 7 Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
#down timeperiod for Server-A
define timeperiod{
timeperiod_name recycle
alias recycle
sunday 04:00-05:30
monday 04:00-05:30
tuesday 04:00-05:30
wednesday 04:00-05:30
thursday 04:00-05:30
friday 04:00-05:30
saturday 04:00-05:30
}
define timeperiod{
timeperiod_name url-monitor
alias url-monitor
use 24x7
exclude recycle
}
Good luck with you plight! I hope someone else can give you a more
simple solution.
Mark Young
___
Nagios Enterprises, LLC
Web: www.nagios.com
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list