Scaling question for all

Jim Cheetham jim at inode.co.nz
Sun Aug 20 05:37:30 CEST 2006

Previous message: Scaling question for all
Next message: Sending SNMP traps to a management host
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Aug 20, 2006, at 8:49 AM, Patrick Mannion wrote:
> The environment is large - 10,000 Windows servers and 6,000
> Linux/Solaris/Tru64 servers (and a dozen VMS boxes) - a total of 
> 120,000
> managed objects in all, from CPUs to processes to filesystems and
> services, located around the world in seven main locations with
> connections from dark fiber to 256k leased lines.

At that size, it must surely be tempting to just purchase some other 
company's toolset, that has well-understood requirements for that size. 
At least, that's the sort of decision many companies make, which 
probably explains Tivloi :-)

> I know that will mean
> a distributed Nagios architecture, but I'm not sure just how it should
> be done.

Well, think about the rate of alerts for a start. With that number of 
objects, what alert rate do you expect in normal operations? If you 
can't pump out alert messages fast enough, it's pointless monitoring 
them.

Also, at that size, a metric telling you how many filesystems are over 
threshold is useless. Correlation is extremely important instead. What 
you need to be doing is concentrating on business impact, which means 
that your monitoring zones should be designed around discrete 
per-project boundaries (well, as discrete as possible), and each one is 
capable of presenting an overview to the next layer up. For example, if 
I'm wanting to know "is Oracle up" I don't need to know if one of a 
RAID set's disks is down. However, the hardware people need to know 
about the disks; they're not so concerned with the applications.

Split up your environment along project/responsibility lines, into as 
many small chunks as possible, and for each one have a monitoring 
solution. Instead of one Nagios with 120000 objects, you'll hopefully 
end up with a hundred-odd Nagios installs with ~1000 objects in each. 
Each business unit can look at their own local/dedicated view, and can 
provide a calculated "business unit status" view to a central overview 
Nagios.

As far as Xen or VMware is concerned, do whatever you need in order to 
make your monitoring as available as possible; if the motoring does 
down, so does your knowledge of your own operations. A Xen object can 
be migrated between hardware platforms without significant downtime 
IIRC; I expect that VMware ESX can do the same. This allows you to keep 
the monitors running while doing essential maintenance on their 
hardware; essential unless you have big iron (by your description you 
don't).

-jim

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Scaling question for all
Next message: Sending SNMP traps to a management host
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list