Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Andreas Ericsson ae at op5.se
Tue Dec 19 10:08:46 CET 2006


Mahesh Kunjal wrote:
> 			
> 
> We had similar issue. We have a distributed environment with one master and 4 slaves. Total number of hosts monitored are 1900+ and
> 20000+ services spread across 4 slaves.
> 
> At times we saw 14K or more results being sent in a second from slaves. This resulted in 100+ nagios processes being created.
> 
> Changed reaper frequency to 2 seconds and played with all tunables.
> Nothing seemed to help.
> 
> Looking at the nagios source,
> This is what I found out was happening...
> 
> Nagios has a commands file worker thread and when it gets woken up, looks if there is data in pipe(nagios.cmd), if exists, forks a child process. This will be in a loop and checks the pipe for data.
> 
> Now what does the forked nagios child process do?
> It reads all the data from the pipe one message a time and puts it in commands buffer. If if is able to write to buffer, just exits.
> 
> The problem here was command buffer had a limited size of 1024. This is the default setting in include/nagios.h.in and is in the line #define COMMAND_BUFFER_SLOTS 1024.

This is the number of buffers that will be available for writing into, 
not the number of total bytes available. Each command buffer slot holds 
MAX_INPUT_BUFFER bytes.

> 
> This was not enough and the child process started to wait for memory to be freed so that the pipe data retrieved can be put in buffer.
> 
> While this child process waited for memory to be freed, the command worker thread got woken up and realized that there is data in pipe and forked another child. This got repeated and eventually server went out of memory.
> 

A very concise and correct description of what's going on. Thanks.

> Here is what we did to resolve.
> 
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
> 
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
> 

This would indeed solve the problem, although you could have gotten away 
with the same amount of SERVICE_BUFFER_SLOTS as there are services 
configured on the system, and the same amount of COMMAND_BUFFER_SLOTS as 
there are hosts and services. Provided the slaves also send passive 
hostchecks, ofc, otherwise you can set it to the amount of services instead.

It should also be noted that these settings shouldn't be modified unless 
needed, as it will make Nagios use quite a bit more memory per default.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list