monitoring critical servers - best practices
Sean McAfee
smcafee at collaborativefusion.com
Wed May 6 15:58:11 CEST 2009
alfonso baldaserra wrote:
> i had been waiting for people to share their experience monitoring
> mission critical systems but it seems there are not many people who do
> that.
I sure am, and I'd be willing to bet that a lot of people here are.
You've got everything "right" in your configs. You likely missed alerts
because of a queue backup, which is usually caused by trying to run too
many checks.
Every time a service check times out, the host is immediately checked.
With a default value of 20 for max_concurrent_checks and typical timeout
of 10 seconds for plugins, it could take 20 seconds for the first non-OK
state during a server reboot. If there are multiple servers being
rebooted, Nagios may never run enough checks while the servers are down.
See
http://nagios.sourceforge.net/docs/2_0/checkscheduling.html#problem_scheduling
for more info.
> p.s. now i am counting on nagios developers to expand on this topic
> possibly by giving some real life examples.
What do you mean by real-life examples?
Generically, here's what I've done to make sure I'm promptly alerted
when things go wrong:
- three facilities with a custom master + slave setup that has each
slave checking their own facility's private LAN as well as all
publically accessible corporate resources (public SMTP, DNS, etc...)
- customized self-promotion/self-demotion for the slaves if they lose
contact with the master
- direct SMS and fallback email-to-SMS and email-to-email alerting for
critical hosts and services
- sane configuration settings
The last one makes the most difference. Because of the possibility for
queue delays, you can't check everything all of the time. Individual
services are what's critical, not the hosts or everything they run. If
you have a "critical" machine that serves up a webapp, run check_http
every minute, but there's no need to do the same check_ssh or check_ntp.
--
Sean McAfee
System Engineer
------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list