monitoring critical servers - best practices
Marc Powell
marc at ena.com
Thu Apr 16 16:35:50 CEST 2009
On Apr 16, 2009, at 7:02 AM, alfonso baldaserra wrote:
> Greetings,
>
> We are using Nagios version 3.0.6 on Fedora core 9.
>
> I was just looking for some ideas how do you guys monitor critical
> servers and services, what are the best practices etc.?
Ping and those services that I consider critical on the specific
server; smtp for smtp servers, http for http servers, filtering on
filtering servers, disk space on all, etc...
> On a related note I just figured we have been missing a lot of
> alerts lately. Today we had to reboot couple of AIX servers which
> usually takes 5+ minutes. Interesting thing is we did not receive
> any notification for these servers. Below is the host configuration
> entry
>
> define host{
> name aix-server ; The name of this
> host template
> use generic-host ; This template
> inherits other values from the generic-host template
> check_period 24x7 ; By default, Linux
> hosts are checked round the clock
> check_interval 2 ; Actively check
> the host every 5 minutes
> retry_interval 1 ; Schedule host
> check retries at 1 minute interval
> max_check_attempts 2 ; Check each Linux
> host 10 times (max)
> check_command check-host-alive ; Default command
> to check aix hosts
> notification_interval 10 ; Resend
> notifications every 2 hours
> notification_options d,u,r ; Only send
> notifications for specific host states
> contact_groups aix-team ; Notifications get
> sent to the admins by default
> register 0 ; DONT REGISTER
> THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
> }
>
> I was just wondering what do I need to change if:
>
> a server goes down
> nagios check after 1 minute, as usual, and finds the server is down
check_interval 1
> nagios checks again after a minute and finds the server is still down
retry_interval 1
> nagios sends notification and keep on sending notification after
> every 10 minutes until the server comes up again
max_check_attempts 2, notification_interval 10. Looks like you just
need to change check_interval. I use --
check_interval 5
retry_interval 3
max_check_attempts 3
> I have checked nagios archives for check_interval, retry_interval
> and max_check_attempts and as a result I got totally confused.
>
> Any help is much appreciated.
>
> P.S. I request nagios developers to either change these options to
> something more meaningful or provide some real life examples.
> Apparently there are many users which have been confused by these
> options as seen in archives.
What would you suggest? The names seem obvious to me but I may be
jaded. The documentation is pretty clear on what they mean/do as well,
at least to me.
Real life examples are provided in the sample config files (or used to
be). The documentation links that may help you are (some redundancy
between them) --
http://nagios.sourceforge.net/docs/3_0/statetypes.html
http://nagios.sourceforge.net/docs/3_0/activechecks.html
http://nagios.sourceforge.net/docs/3_0/servicechecks.html
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#service
And this one from the -2.0 docs that nicely explains much of the
scheduling... It's a little dated wrt host checks but for the most
part is a good read--
http://nagios.sourceforge.net/docs/2_0/checkscheduling.html
--
Marc
------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list