Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)
Ethan Galstad
nagios at nagios.org
Thu Dec 21 17:56:49 CET 2006
Good work on nailing down the problem to the command buffer slots!
Sounds like this problem might affect a number of users, so I think we
need to patch Nagios. There are two possible solutions:
1. Bump up the default buffer slots to something larger. Since Nagios
only immediately allocates memory for pointers, the additional memory
overhead is fairly small. Allocated memory = (sizeof(char **)) * (# of
slots).
2. Moving the slots definitions out to command file variables. This is
a better solution than having to edit the code and recompile.
Thoughts?
Ton Voon wrote:
> Hi Mahesh,
>
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
>
>> Here is what we did to resolve.
>>
>> 1. Edit the include/nagios.h.in
>> change
>> #define COMMAND_BUFFER_SLOTS 1024
>> to
>> #define COMMAND_BUFFER_SLOTS 60000
>>
>> And change
>> #define SERVICE_BUFFER_SLOTS 1024
>> to
>> #define SERVICE_BUFFER_SLOTS 60000
>>
>
> I was intrigued by this as we have a performance issue, but not with the
> same symptoms. Our problem is that NSCA processes increase when the
> nagios server is under load. They appear to be blocking on writing to
> the command pipe. Switching NSCA to single daemon mitigates the problem
> (slaves will timeout their passive results), but we wanted to know where
> any slow downs could be.
>
> From your findings, we've created a performance static patch, attached.
> This collects the maximum and current values for the command and service
> buffer slots and is then written to status.dat (by default every 10
> seconds). What I found with a fake slave sending 128 results every 5
> seconds was that the maximum values were fairly low (under 100), but
> when I put the server under load, the maximum_command_buffer_items shot
> up to 1969 and the maximum_service_buffer_items shot up to 2156 (had
> changed from defaults to your 60000).
>
> This could show if the buffer is filled at various points or if there is
> not enough data ready for Nagios to process further down the chain.
>
> I'd be interested in figures from other systems.
>
> Warning: the patch is not thread safe, so there is no guarantees that
> the statistic data will not be corrupted (but should not affect usual
> Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6
> kernel.
>
> Ton
>
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
>
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list