Scaling question for all
Jim Cheetham
jim at inode.co.nz
Sun Aug 20 05:37:30 CEST 2006
On Aug 20, 2006, at 8:49 AM, Patrick Mannion wrote:
> The environment is large - 10,000 Windows servers and 6,000
> Linux/Solaris/Tru64 servers (and a dozen VMS boxes) - a total of
> 120,000
> managed objects in all, from CPUs to processes to filesystems and
> services, located around the world in seven main locations with
> connections from dark fiber to 256k leased lines.
At that size, it must surely be tempting to just purchase some other
company's toolset, that has well-understood requirements for that size.
At least, that's the sort of decision many companies make, which
probably explains Tivloi :-)
> I know that will mean
> a distributed Nagios architecture, but I'm not sure just how it should
> be done.
Well, think about the rate of alerts for a start. With that number of
objects, what alert rate do you expect in normal operations? If you
can't pump out alert messages fast enough, it's pointless monitoring
them.
Also, at that size, a metric telling you how many filesystems are over
threshold is useless. Correlation is extremely important instead. What
you need to be doing is concentrating on business impact, which means
that your monitoring zones should be designed around discrete
per-project boundaries (well, as discrete as possible), and each one is
capable of presenting an overview to the next layer up. For example, if
I'm wanting to know "is Oracle up" I don't need to know if one of a
RAID set's disks is down. However, the hardware people need to know
about the disks; they're not so concerned with the applications.
Split up your environment along project/responsibility lines, into as
many small chunks as possible, and for each one have a monitoring
solution. Instead of one Nagios with 120000 objects, you'll hopefully
end up with a hundred-odd Nagios installs with ~1000 objects in each.
Each business unit can look at their own local/dedicated view, and can
provide a calculated "business unit status" view to a central overview
Nagios.
As far as Xen or VMware is concerned, do whatever you need in order to
make your monitoring as available as possible; if the motoring does
down, so does your knowledge of your own operations. A Xen object can
be migrated between hardware platforms without significant downtime
IIRC; I expect that VMware ESX can do the same. This allows you to keep
the monitors running while doing essential maintenance on their
hardware; essential unless you have big iron (by your description you
don't).
-jim
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list