monitoring critical servers - best practices
alfonso baldaserra
alfonso.baldaserra at gmail.com
Thu Apr 16 14:02:48 CEST 2009
Greetings,
We are using Nagios version 3.0.6 on Fedora core 9.
I was just looking for some ideas how do you guys monitor critical servers
and services, what are the best practices etc.?
On a related note I just figured we have been missing a lot of alerts
lately. Today we had to reboot couple of AIX servers which usually takes 5+
minutes. Interesting thing is we did not receive any notification for these
servers. Below is the host configuration entry
define host{
name aix-server ; The name of this host
template
use generic-host ; This template inherits
other values from the generic-host template
check_period 24x7 ; By default, Linux hosts
are checked round the clock
check_interval 2 ; Actively check the host
every 5 minutes
retry_interval 1 ; Schedule host check
retries at 1 minute interval
max_check_attempts 2 ; Check each Linux host 10
times (max)
check_command check-host-alive ; Default command to check
aix hosts
notification_interval 10 ; Resend notifications
every 2 hours
notification_options d,u,r ; Only send notifications
for specific host states
contact_groups aix-team ; Notifications get sent to
the admins by default
register 0 ; DONT REGISTER THIS
DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
I was just wondering what do I need to change if:
a server goes down
nagios check after 1 minute, as usual, and finds the server is down
nagios checks again after a minute and finds the server is still down
nagios sends notification and keep on sending notification after every 10
minutes until the server comes up again
I have checked nagios archives for check_interval, retry_interval and
max_check_attempts and as a result I got totally confused.
Any help is much appreciated.
P.S. I request nagios developers to either change these options to
something more meaningful or provide some real life examples. Apparently
there are many users which have been confused by these options as seen in
archives.
Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090416/98490822/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today.
Use priority code J9JMT32. http://p.sf.net/sfu/p
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list