[Nagios-users] Large scale network monitoring limits with nagios
Jason Lancaster
jason at teklabs.net
Thu Mar 11 16:29:44 CET 2004
Noah Leaman wrote:
> Using the concept of one service per up/down trap for each network
> interface, I tested a little by creating a very simple set of nagios
> configs, but with about 8000 PASSIVE service checks and no active
> service checks. of course there was no problem in terms of scheduling
> issues, but the CGIs all crawled to a snails pace. In my setup (nagios
> 1.2, Dual G4 first-gen xServe) it takes about 30 secs to display the
> Status Summary page.
>
> ... So 9236 services all together but this is really just a small
> subset of what I would like to be able to do. The plan is to through
> hardware at it to spread out the real work being done (i.e. the active
> checks).
>
> But with just this setup, a single CGI take up an entire CPU to run
> and for a few minutes a lot of the time... and the plan was to have a
> good handful of GUI users (5 ish at a time)... it's just about
> unusable with one GUI user.
I'm using a distributed environment of 4 servers to monitor 6200
services so I'm not displaying quite as much as you but I am close. My
designated central server that runs the cgi's is a dual AMD 2200 with
3gb of ram. I am not using 1.2, I am using 1.1 with a cgi patch
submitted to the devel list by David Parrish. Viewing cgi's as an admin
user who has access to all services/hosts causes no problems for me. I
have not tested 1.2 because 1.1 works quite well for me and I have not
wanted any headaches.
The only complaint I have about the cgi's after the patch is that they
take up between 20-50% of a cpu every time someone loads them up. If too
many people in the company are browsing around things can get really
slow. I used to cache some of the pages every few minutes but I just
didn't like the idea of caching the data.
> How to monitor traps for hundreds of network hosts and tens of
> thousands different interfaces each of which could generate up/down
> traps along with other traps. I tried setting up a single "catch-all"
> trap service per host, but notification would need to occur when going
> from and OK to another OK (with a different output). Shouldn't this
> work with is_volatile on and stalking_options set to o,w,u,c (every
> test I've done to get this working from OK to OK doesn't work... but
> maybe I missed something).
Mmmm, this is def a users question. Personally, I do not use the
volatile option because we rely entirely on web interfaces (no email
notifications) to let us know what is going on. I have a "trap server"
running a "snmptrapd log watcher" program which watches the snmptrapd
log for events. If a failure on a device triggers a trap with a oid that
is recognized it flags the service as critical until someone
acknowledges it in the web interface.
Lots of people have other ways of accomplishing this.
> So the higher level question here is am I over my head in what or how
> I can do this with nagios? After tackling the network monitoring
> needs, the plan was to then start the server monitoring (around 1000
> servers of many platforms).
If I ever migrate to 1.2, I'll be sure to let the list know if I have
cgi slowness.
Jason
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
More information about the Users
mailing list