monitoring critical servers - best practices

Marc Powell marc at ena.com
Thu Apr 16 16:35:50 CEST 2009
Previous message: monitoring critical servers - best practices
Next message: check_by_ssh
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Apr 16, 2009, at 7:02 AM, alfonso baldaserra wrote:

> Greetings,
>
> We are using Nagios version 3.0.6 on Fedora core 9.
>
> I was just looking for some ideas how do you guys monitor critical  
> servers and services, what are the best practices etc.?

Ping and those services that I consider critical on the specific  
server; smtp for smtp servers, http for http servers, filtering on  
filtering servers, disk space on all, etc...

> On a related note I just figured we have been missing a lot of  
> alerts lately.  Today we had to reboot couple of AIX servers which  
> usually takes 5+ minutes.  Interesting thing is we did not receive  
> any notification for these servers.  Below is the host configuration  
> entry
>
> define host{
>         name                     aix-server      ; The name of this  
> host template
>         use                      generic-host    ; This template  
> inherits other values from the generic-host template
>         check_period             24x7            ; By default, Linux  
> hosts are checked round the clock
>         check_interval           2               ; Actively check  
> the host every 5 minutes
>         retry_interval           1               ; Schedule host  
> check retries at 1 minute interval
>         max_check_attempts       2               ; Check each Linux  
> host 10 times (max)
>         check_command            check-host-alive ; Default command  
> to check aix hosts
>         notification_interval    10              ; Resend  
> notifications every 2 hours
>         notification_options     d,u,r           ; Only send  
> notifications for specific host states
>         contact_groups           aix-team        ; Notifications get  
> sent to the admins by default
>         register                 0               ; DONT REGISTER  
> THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
>         }
>
> I was just wondering what do I need to change if:
>
> a server goes down
> nagios check after 1 minute, as usual, and finds the server is down

check_interval 1

> nagios checks again after a minute and finds the server is still down

retry_interval 1

> nagios sends notification and keep on sending notification after  
> every 10 minutes until the server comes up again

max_check_attempts 2, notification_interval 10. Looks like you just  
need to change check_interval. I use --

         check_interval           5
         retry_interval            3
         max_check_attempts              3

> I have checked nagios archives for check_interval, retry_interval  
> and max_check_attempts and as a result I got totally confused.
>
> Any help is much appreciated.
>
> P.S.  I request nagios developers to either change these options to  
> something more meaningful or provide some real life examples.   
> Apparently there are many users which have been confused by these  
> options as seen in archives.

What would you suggest? The names seem obvious to me but I may be  
jaded. The documentation is pretty clear on what they mean/do as well,  
at least to me.

Real life examples are provided in the sample config files (or used to  
be). The documentation links that may help you are (some redundancy  
between them) --

http://nagios.sourceforge.net/docs/3_0/statetypes.html
http://nagios.sourceforge.net/docs/3_0/activechecks.html
http://nagios.sourceforge.net/docs/3_0/servicechecks.html
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#service

And this one from the -2.0 docs that nicely explains much of the  
scheduling... It's a little dated wrt host checks but for the most  
part is a good read--

http://nagios.sourceforge.net/docs/2_0/checkscheduling.html

--
Marc


------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: monitoring critical servers - best practices
Next message: check_by_ssh
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list