checks, notifications don't work after time period exception

Mark Young myoung at nagios.org
Mon Aug 25 17:43:42 CEST 2008
Previous message: checks, notifications don't work after time period exception
Next message: checks, notifications don't work after time period exception
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Seth,

On Aug 25, 2008, at 8:05 AM, Seth Simmons wrote:

> We have a qa group overseas that will work on our customer sites  
> during the US overnight.  To avoid false alerts, I added a time  
> exception so notifications are not sent out between 4am and 5:30am.   
> The problem is, after the exception, Nagios (3.0.3) won’t send  
> notifications, neither are checks performed for any sites with an  
> exception.  If a site is in a critical state either shortly after 4  
> or (if they start early) right before 4, checks do not continue  
> after 5:30.  When I look at Nagios later, it shows it in critical  
> and the last check was done at 3:58am with the next check at  
> midnight the next day.

When I start dealing with time problems with Nagios I have a small  
list that I try first just to it out.
* Check date/time of monitoring server and that it is the right  
timezone (UTC or whatever you want it as).
* Check that the Nagios web interface is displaying the time you  
expect it to (top left corner in most cgis).  In the nagios.cfg you  
may have set additional time information in there.
* Stop the nagios process, checking that there are are no other  
running instances left.  'service nagios stop' 'ps aux |grep nagios'
* Restart the nagios process.

Sometimes you can get duplicate Nagios daemons running and they can  
cause many odd problems like this.  Also I hope we are not dealing  
with any time translations with the "overseas" group.


>
> Let me give some more specific examples:
> Server-A is running abc.customer.com for us and our qa group takes  
> the site down at 3:55am, before the 4am exception.  Nagios will show  
> as critical until either midnight the next day, or you force a check  
> on the service.  So, say at 8am I look at it, the service is  
> critical with last check at 3:55am and next scheduled check at 12am  
> tomorrow.  When I force a check, it will continue on normal check  
> schedule and send notice that the service is ok.

So you are saying that "Server-A" is supposed to be checked in the  
timerange 24x7 minus 4:00am-5:30am each day, but when it stops at  
4:00am it will not start checking until the next day, unless you force  
it through an external command to start checking again?  It is  
possible that there could be a bug, but you seem to have a really  
common timeperiod definition type.  I normally suggest that users  
always run the checks 24x7 and then just modify the notification  
periods (like you did with 'Server-B).  But I would try it with a  
simple time definition first.

# Test timeperiod for the recycle service.
define timeperiod{
         timeperiod_name recycle
         alias           recycle
         sunday          00:00-04:00,05:30-24:00
         monday          00:00-04:00,05:30-24:00
         tuesday         00:00-04:00,05:30-24:00
         wednesday       00:00-04:00,05:30-24:00
         thursday        00:00-04:00,05:30-24:00
         friday          00:00-04:00,05:30-24:00
         saturday        00:00-04:00,05:30-24:00
         }


Also what does your "generic-service" and "local-service" templates  
look like?  There could be some settings that are following you  
through those templates.  Also you may have modified some settings in  
the nagios.cfg that makes changes to how nagios deals with time.


>
> Server-B is also running a site and tomcat is stopped at 4:10am.   
> This service has notification period with the same time period with  
> exceptions from 4am – 5:30am.  After that it will not send  
> notifications.  At 8am it is still doing checks and saying is  
> critical, but when looking at the details it says it has not sent  
> any notifications.  When I force a check it still won’t do it.  If I  
> restart Nagios then it does a check it will send first notice.  I  
> don’t see anything wrong with my time period so not sure where the  
> issue is.  Not sure if anyone else has noticed this before.

The difference between those are that they are using a different  
service template.  Server-B is using 'local-service'.

>
> Here is what I have for that time period and checks for the above  
> examples:
>
> define timeperiod{
>                 timeperiod_name           url-monitor
>                 alias                       url-monitor
>                 sunday                 00:00-23:59
>                 monday               00:00-23:59
>                 tuesday                00:00-23:59
>                 wednesday        00:00-23:59
>                 thursday              00:00-23:59
>                 friday                    00:00-23:59
>                 saturday              00:00-23:59
>                 exclude                recycle
>                 }



This is how I would have wrote the timeperiod definitions to make them  
more clear.  I've used the exclude method many times so I am sure that  
it works as you are expecting.

define timeperiod{
         timeperiod_name 24x7
         alias           24 Hours A Day, 7 Days A Week
         sunday          00:00-24:00
         monday          00:00-24:00
         tuesday         00:00-24:00
         wednesday       00:00-24:00
         thursday        00:00-24:00
         friday          00:00-24:00
         saturday        00:00-24:00
         }

#down timeperiod for Server-A
define timeperiod{
         timeperiod_name recycle
         alias           recycle
         sunday		04:00-05:30
         monday		04:00-05:30
         tuesday		04:00-05:30
         wednesday	04:00-05:30
         thursday	04:00-05:30
         friday          04:00-05:30
         saturday	04:00-05:30
         }


define timeperiod{
         timeperiod_name url-monitor
         alias           url-monitor
	use		24x7
	exclude		recycle
         }





Good luck with you plight!  I hope someone else can give you a more  
simple solution.

Mark Young
___
Nagios Enterprises, LLC
Web:    www.nagios.com
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: checks, notifications don't work after time period exception
Next message: checks, notifications don't work after time period exception
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list