Load issues with Nagios

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Tue Jan 28 07:12:47 CET 2003


Dear Sir,

I am writing to thank you for your letter and say that your load sounds 
too high and I can't understand why.

This host

Dec  7 21:47:35 tsitc /kernel: Timecounter "i8254"  frequency 1193182 Hz
Dec  7 21:47:35 tsitc /kernel: CPU: Pentium III/Pentium III Xeon/Celeron 
(851.94-MHz 686-class CPU)
Dec  7 21:47:35 tsitc /kernel: Origin = "GenuineIntel"  Id = 0x686  
Stepping = 6

(a Dell 350 Celeron I think)

runs 334 active service checks on 192 hosts (+ a few passives via 
snmptrapd) and the load is

tsitc> uptime
 4:54PM  up 51 days, 19:08, 4 users, load averages: 0.08, 0.12, 0.14

no more than an average of 20% CPU utilisation.

This host also hosts Apache (for Nagios) and a few other moderately used 
applications.

Unfortunately, I cannot offer any indication as to why the CPU load is
so high apart from your machine is being forked to death because of your 
relatively low check interval - but you are getting check results sent
by NCSA and so your Nag should only need to check the command queue 
every two minutes and process it.

The minimum check interval here is 5 minutes (1 minute for soft states).



On Mon, Jan 27, 2003 at 09:13:06PM -0500, Ken Snider wrote:
> Greetings all,
> 
> We have Nagios running, about 50 servers, 250 processes monitored. The vast 
> majority of these checks are via nsca/send_nsca, with the only active checks 
> being host-checks, and ping checks. nsca is running via Xinetd, 
> command_check_interval is -1, reaper frequency is 5, and everything works 
> reasonably well. Services are configured to check in every 2 minutes.
> 
> My issue is, the box is *always* at 100% CPU. At first, I figured this was 
> related to the 50 or so send_nsca connections (and subsequent dumps to 
> Nagios' pipe) that occur every two minutes via nsca. However these are dealt 
> with within 20 or so seconds, leaving about a minute and a half where all 
> Nagios is really doing is pinging.. yet the CPU usage remains.
> 
> My next thought was the command_check_interval being -1. setting it to 1 had 
> no difference.
> 
> I tried raising the reaper interval to 10 seconds as well, no difference. I 
> lowered the max_concurrent_checks to 40, no change. consistently 100% use.
> 
> Set the status_update interval to 30 seconds, to make sure it wasn't the 
> writes to the status file, nothing (yep, aggregate writes is on).
> 
> The system, BTW, is RH8, PIII-750, with 256 MB RAM and a Gig of Swap. The 
> box is using a whopping 60MB of ram. The box almost never uses IO (save the 
> mad rush every two minutes from NSCA), and has nothing else of import (or 
> load) running. Load is all from the parent nagios process.
> 
> So, here's my question.. is this load, perhaps, *normal*? Is a P-III 750 
> really, truly maxed out with about 50 active services (ping mainly), and 300 
> passive services running?
> 
> I personally find that difficult to believe, but I like to hear everyone's 
> thoughts on the subject. ;)
> 
> Thanks in advance. :)
> 
> -- 
> Ken Snider
> Senior Systems Administrator
> Datawire Communication Networks Inc.

Good luck,

Yours sincerely.

-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.


-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com




More information about the Users mailing list