Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)
Andreas Ericsson
ae at op5.se
Mon Nov 28 17:09:58 CET 2005
linux-system-technik at de.man-mn.com wrote:
> Hi everybody,
>
> unfortunately nobody answered to Alex from viveconsulting.co.nz who had a
> problem with "Nagios spawning rogue ..." and mailed to nagios mailing list
> some months ago.
A link to the mail archives would be helpful.
> Right now, we have the same problemn very likely he
> described in a very detailed way. I tried also a lot of different things
> (from configuration changes to tuning issues) to find out the real problem
> and I guess the real bottleneck is the pipe used for communication between
> Nagios processes.
Most likely. It's the only real bottleneck in nagios today, so...
> But I found not many reports e.g. emails about this
> problem in the web and mail archives.
>
> So why am I writing to list? Maybe someone can give me a hint, how to solve
> or workaround that problem? We have 677 services configured and use 350
> RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is
> a little bit more than 1.00. As long as we stay below 1.00 no problem, but
> otherwise ... (Detailed problem description in Alexs' mail)
>
CMS? Content Management System?
Anyways, 677 services shouldn't be a problem.
> This is just our start with Nagios. We want to configure thousands of
> services and more than 100 hundred hosts. We would also invest in faster
> hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware
> an option?
It helps, but not very much I'm afraid. The bottleneck requires a kernel
recompile to be solved on most systems, and that's a very bad thing to
do just to fix this particular problem.
> Looking at this issue with the focus on implementation: If the
> pipe is the bottleneck it will stay a bottle neck on faster hardware too.
> But maybe faster hardware will allow us to configure 3000 services, what
> would be enough for the Nagios instance. And then, we deploy another Nagios
> instance ...
>
This is definitely a solution. Otherwise you could keep your eyes open
in the somewhat near future for a mail with
[PATCH] checks: Multiplex running checks.
in the topic. I'm working on it right now, but perhaps Ethan won't let
it in for the 2.x branch since it's a fairly massive change.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
More information about the Developers
mailing list