Feedback on Nagios
Marcus Vogt
mgvogt at bigpond.com.au
Sat Dec 13 13:55:11 CET 2003
Hi guys,
Just a bit of feedback on Nagios and the issues we have and are running
into with it.
The issues I raise may be because I do not understand Nagios or its
plugins sufficiently
or I am trying to use it in a way it is not intended. I am more than
happy with any
constructive feedback or criticism on this.
I'd like to say first of all thanks for a really good product it does a
pretty darn good job
of presenting the status of a service type network very well.
We previously used HP OpenView Network Node Manager (NNM) exclusively,
however
it is designed specifically for monitoring network type devices and has
no real concept of
services and service dependencies.
We are in the initial phases of trying Nagios out (about 1-2 months).
It is important to note
that I have not tinkered with the internals other than a small change
for permissions. I have
been focused on getting it to monitor and report on things.
We continue to use NNM for network discovery and also network display
(drill down and
containers) as this is an area where Nagios is weak in comparison.
** Nagios has poor network discovery facilities - particularly Layer 2.
** Nagios does not have a network "drill down" type map ala NNM.
** Nagios can not handle really large layouts - does not fully draw the
map - have not
investigated this issue yet though.
** I have not yet found a way to have nagios understand complex network
dependencies.
We have a large number of redundant paths, and we can not draw
these correctly. This
is very probably a lack of understanding on my part though.
I know that there is nmap discovery, however it does not give you the
layer 2 type dependency
information that is important for correctly configuring dependencies.
Additionally this does
not discover SNMP variables for CPU, Memory, Storage, or Networking.
Now I understand
that there are a 1001 different SNMP MIB's out there, but there are some
rather obvious ones
that hold a fair amount of market share:
Cisco and HP for Networking equipment.
HOST MIBs (Covers MS Win2k, NetSNMP and others)
IF MIBs for network interfaces (IP & Layer 2) and routing.
** Nagios has poor SNMP discovery of common services on MIBs. Is this
really a problem though?
Perhaps this is the responsibility of individual deployments.
Now this isn't a major problem for us - I wrote discovery scripts in
perl that given a list of
hosts, will interrogate their SNMP services and provide all the goodies
- services, service dependencies,
service extended information (discussion later) but that lead us to the
next problem.
Currently with the first sweep of discovery (excluding networking type
queries) we ended up with around
300 hosts and 1500 services with checking of services every 5 minutes.
This absolutely hammered CPU of the
box it was on (Sun E250 Dual CPU and 2Gb Memory). This was okay, we
used the embedded perl option
and this got us to just under 100% utilisation. Yes this is an issue
with the plugins and I'll discuss this later.
One of the problems with the embedded perl is that it has a rather large
memory leak. This has to be reset
every four to five days as it creeps up to 3-400 Mb Ram. I know this is
being addressed in the next version,
but I do point it out. Nb. We use caching as well.
** Nagios Embedded perl (with caching) leaks memory a lot - work around
is in next version.
To compound this we use performance monitoring to feed data into RRD
tool for further processing.
We use RRD (RRDcgi is really neat) to provide historical trends and also
handle non-gauge type collections
such as counters. Admittedly we run this at maximum nice levels to
ensure it does not impact primary
data collection work.
** Nagios does not natively deal with counters - not really a Nagios
problem, just an observation. i.e. write your
own plugins (we have).
** Nagios does not natively collect data that can be graphically
displayed "out of the box" - again not really a problem
just an observation. Everyone can roll their own, but it would be
nice if something was provided.
(I can provide my simple prototype perl scripts if you like, but I
think perl is a bit of a problem as it is not going to scale well.)
Now I mentioned that I wrote my own discovery scripts for SNMP. These
are targetted at HOST & Vendor Mibs for Win2K
and HOST Mibs for Unix hosts. This woks very nicely giving us the 1500
odd services. The problem came when I went to
deploy the Network discovery tools that monitor interfaces via the IF
Mibs. I discovered in excess of 7000 services only
on network devices. Given that I monitor %utilisation, %errors, %drops,
and one other for both in and out, this gives you an
idea of the number of interfaces.
Suffice to say, this caused the box (already heavily loaded) to have
kittens. Things ran very very slowly. The things I found
really interesting was that becasue each service pretty much had its own
dependency back to SNMP, running the Nagios
config check (with nothing else running) would take 40+ seconds. This
is the real kicker. This means that Nagios will not
scale well to even medium sites (I think we fall under small/medium).
This is a real concern as NNM can do this without even breaking a sweat
- admittedly it does not have the dependency type
information included.
** Nagios reading of configuration files apears to be expensive.
Given that each CGI reads the config file every time it runs (refreshes)
it means that there is this huge delay - to the point
where stupid IE will show a previously cached page because it timed out.
I have seen patches to improve performance on this (have not yet
implemented/tested) and I think this is improved on the next
version.
** Nagios CGI's re-read configs on every execution - this leads to poor
scaling.
If the interface CGI could be run as some sort of daemon along with
Nagios, it could drastically improve performance by removing
that need to re-read config on every connection. Otherwise, this will
not be able to scale well to a larger number of end
users.
** I really like the interface - particularly how you can do customised
views per user.
This is a really big benefit of Nagios - gets over information
overload when people
only need to see one thing. A good example of this for us is
facilities management -
they do not need to see all the details, but they do want to see
any environmental
information (temp, humidity, voltage, etc..) from any device.
The joy of Plugins.
I think the plugin concept, in conjunction with passive monitoring makes
nagios a really powerful tool.
I have prototyped all my plugins and discovery tools in perl. The
reason being is that I am comfortable
with it and find it a really good tool to knock up quick prototypes
with. I usually then write this up in C
after I am happy with the workings of and lessons from the prototype.
One of the reasons my plugins are slowish (aside from the fact that they
are perl) is that I do SNMP
gets based on labels. This means that I will ask for disk utilisation
on the filesystem /var or C:\ etc..
This is particularly important to me as this may have different instance
numbers depending on what
machine you are on. Plus you don't want to have N service definitions
for just one type of collection.
I'll fix this by having caching of instance numbers in the production
version.
Because we also have varied SNMP communities all over the place - don't
you love security? - I also
have to handle this on the fly as well to again limit the number of
service definitions.
** I am not sure that active checks scale at all well under Nagios.
I am planning to convert all the active SNMP checks into passive ones
and run a daemon to schedule
and collect the data and then feed it to Nagios. This will resolve the
issue of having lots of processes
being kicked off. I'm looking at snmpp, but it is not quite what I am
after.
I'll probably use active checks as a backup when the passive fails as a
confirmation - though this has
inherent scaling risks also. I'd have to check how Nagios handles
things like dependencies and the like
before I comments sensibly on this one.
Anyhow, in all this rambling I'm trying to say I think it is a mighty
fine product with a couple of things
that could be improved to meet our needs (possibly others as well).
I'll be working on fixing the things
I see as issues for us and I'll see if I can get permission to release
those back to the community. Admittedly
all I have at the moment are poorly performing Perl prototypes :)
Cheers,
Marcus.
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
More information about the Developers
mailing list