Number of Nagios Processes Distributed Monitoring
Mooney, Ryan
ryan.mooney at pnl.gov
Mon Aug 25 18:05:55 CEST 2003
No official fix that I've seen, although I haven't been tracking CVS.
At least on other solution was proposed, this may be a "better" solution, I don't
know (both are a bit ugly IMHO, but hey can't argue to hard w/ success).
Jay 'Whip' Grizzard [elfchief at lupine.org] explained the problem thusly:
After much investigation, the best conclusion I've been able to draw is that
the process scheduling on our system (RedHat 8.0) is behaving in such a way
that, after the parent process is able to clear out its pipe a bit (thus
freeing some buffer), some processes are always scheduled -last-, giving
other (newer) processes a chance to fill up the pipe's buffer again before
the 'hung' processes get a chance to run again.
Since I didn't feel like rewriting the linux process scheduler, I instead
opted to increase the size of the pipe buffer in the kernel (there's a
define for PIPE_SIZE that's normally set to PAGE_SIZE -- 4k on x86). I
increased it to (8 * PAGE_SIZE) and rebuilt my kernel, under the theory
that a larger buffer would give processes a much larger chance of being
able to get some data into the buffer before it filled ... and, indeed,
the 'hung' processes seem to have gone away -- After 24 hours, the oldest
nagios subprocesses on the box are, at worst, one minute old.
> -----Original Message-----
> From: Mike Benoit [mailto:mikeb at netnation.com]
> Sent: Monday, August 25, 2003 8:59 AM
> To: Mooney, Ryan
> Cc: nagios-users; nagios-users at lists.sourceforge.net
> Subject: RE: [Nagios-users] Number of Nagios Processes Distributed
> Monitoring
>
>
> I'm having the exact same problem with Nagios 1.1. There
> hasn't been any
> official fix for this released yet correct? It sure makes
> using passive
> checks difficult. :(
>
> On Fri, 2003-07-25 at 09:56, Mooney, Ryan wrote:
> > I had a simular problem when doing lots of external checks.
> The sub process that
> > gets forked to read the results from the .cmd pipe and then
> write them to the shared
> > fd to the master process would block (forever) on the write
> call. I never did figure
> > out why, since the code appeared to be correct. I ended up
> putting an alarm around
> > the write call and timing it out if it hung to long. I
> figured that loosing a few
> > passive checks was worth not having memory fill up & having
> the machine die. Based on
> > the behavior I saw, I'm not really convinced that the
> problem is 100% limited to the
> > passive checks though, as a very simular set of routines is
> used by the active checks
> > code.
> >
> > If you compile nagios with debugging (export "CFLAGS=-g";
> ./configure --whatever-options-you-use; make; make install)
> and then watch the "ps aux" output you'll notice
> > that there is one really long running process that takes a
> fair bit of CPU (which is
> > the good master) and then over time you'll start seeing
> some other processes that have
> > a start time a fair bit in the past that never die. If you
> attach to one of these with
> > a debugger (say "cd /wherever/you/compiled/nagios/; gdb
> base/nagios [pid]" where [pid]
> > is the process ID of one of the processes with a start time
> > 1hr ago that is not the
> > master process) and do a "bt" to get a call trace out of it
> that would likely help
> > determine where the processes are getting stuck.
> >
> > If you are having the same problem I was you will likely
> see "process_passive_service_checks" and/or
> "check_for_external_commands" in the call trace
> > (sometimes the stack looks munged so the call stack may not
> be 100% accurate, leading me
> > to believe that some corruption is whats causing the write
> to hang, but I wasn't able to
> > figure out what was causing the corruption easily and had
> to "get things working").
> >
> > I'd be curious to see if its the same problem.
> >
> > > >Jasmine
> > > I am pretty sure, not nagios itself, but memory ran out
> and the server
> > > stood.
> > > At the moment I have a nagios uptime of :
> > >
> > > Total Running Time: 0d 6h 6m 15s
> > > And this...
> > > Check Command Output: Nagios ok: located 1677 processes,
> status log
> > > updated 170 seconds ago
> > >
> > > I am pretty sure this is mot ok,
> > >
> > > Any Ideas ?
> > >
> > > I will let the server run over the weekend, when it
> crashes again, I
> > > give detailed information to the list.
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > This SF.Net email sponsored by: Free pre-built ASP.NET
> sites including
> > > Data Reports, E-commerce, Portals, and Forums are available now.
> > > Download today and enter to win an XBOX or Visual Studio .NET.
> > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet
> > > _072303_01/01
> > > _______________________________________________
> > > Nagios-users mailing list
> > > Nagios-users at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > > ::: Please include Nagios version, plugin version (-v) and OS
> > > when reporting any issue.
> > > ::: Messages without supporting info will risk being sent
> to /dev/null
> > >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email sponsored by: Free pre-built ASP.NET
> sites including
> > Data Reports, E-commerce, Portals, and Forums are available now.
> > Download today and enter to win an XBOX or Visual Studio .NET.
> >
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list