Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)
Mahesh Kunjal
mkunjal at gmail.com
Thu Dec 21 18:14:12 CET 2006
On 12/21/06, Ethan Galstad <nagios at nagios.org> wrote:
> Good work on nailing down the problem to the command buffer slots!
> Sounds like this problem might affect a number of users, so I think we
> need to patch Nagios. There are two possible solutions:
>
> 1. Bump up the default buffer slots to something larger. Since Nagios
> only immediately allocates memory for pointers, the additional memory
> overhead is fairly small. Allocated memory = (sizeof(char **)) * (# of
> slots).
>
Since you generate the nagios.h from configure. Based on RAM
available, this number could be generated from configure script.
> 2. Moving the slots definitions out to command file variables. This is
> a better solution than having to edit the code and recompile.
Yes this will be better solution. Also can we have additional
information displayed like buffers(command & service) in use(as to how
many messages occupied), messages in pipe(nagios.cmd) and messages in
message queue from nagiostats ?
>
> Thoughts?
>
>
> Ton Voon wrote:
> > Hi Mahesh,
> >
> > On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
> >
> >> Here is what we did to resolve.
> >>
> >> 1. Edit the include/nagios.h.in
> >> change
> >> #define COMMAND_BUFFER_SLOTS 1024
> >> to
> >> #define COMMAND_BUFFER_SLOTS 60000
> >>
> >> And change
> >> #define SERVICE_BUFFER_SLOTS 1024
> >> to
> >> #define SERVICE_BUFFER_SLOTS 60000
> >>
> >
> > I was intrigued by this as we have a performance issue, but not with the
> > same symptoms. Our problem is that NSCA processes increase when the
> > nagios server is under load. They appear to be blocking on writing to
> > the command pipe. Switching NSCA to single daemon mitigates the problem
> > (slaves will timeout their passive results), but we wanted to know where
> > any slow downs could be.
> >
> > From your findings, we've created a performance static patch, attached.
> > This collects the maximum and current values for the command and service
> > buffer slots and is then written to status.dat (by default every 10
> > seconds). What I found with a fake slave sending 128 results every 5
> > seconds was that the maximum values were fairly low (under 100), but
> > when I put the server under load, the maximum_command_buffer_items shot
> > up to 1969 and the maximum_service_buffer_items shot up to 2156 (had
> > changed from defaults to your 60000).
> >
> > This could show if the buffer is filled at various points or if there is
> > not enough data ready for Nagios to process further down the chain.
> >
> > I'd be interested in figures from other systems.
> >
> > Warning: the patch is not thread safe, so there is no guarantees that
> > the statistic data will not be corrupted (but should not affect usual
> > Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6
> > kernel.
> >
> > Ton
> >
> > http://www.altinity.com
> > T: +44 (0)870 787 9243
> > F: +44 (0)845 280 1725
> > Skype: tonvoon
> >
>
>
> Ethan Galstad,
> Nagios Developer
> ---
> Email: nagios at nagios.org
> Website: http://www.nagios.org
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list