Nagios2 process overwhelmed by NSCA daemon?
Thomas Guyot-Sionnest
dermoth at aei.ca
Mon Dec 14 05:22:48 CET 2009
On 09/12/09 06:06 PM, Jonathan Call wrote:
> I recently added two new slaves to a distributed Nagios system. The
> central server now passively processes 17,000+ service checks on 3000+
> servers.
>
> It's been over an hour and a half since I brought those new slaves
> online and I have about 150 hosts still stuck in 'Pending' and about
> 1300 services in the same state. In addition to that it seems that the
> service check results from the other slaves that were working normally
> are now arbitrarily disappearing. For example, on one host three of the
> service checks have been updated relatively recently (i.e. 5-30 minutes
> ago) but three other service checks haven't been updated for almost an
> hour. The slaves all appear operational and the hosts are being checked
> on time. Is it possible I've overwhelmed Nagios' ability to process data
> from the NSCA daemon or struck some internal Nagios bottleneck? Any
> suggestions would be appreciated.
Hummmm Very interesting. Which Nagios version are you using?
This sounds a lot like a problem I encountered a few years ago with
passive checks. I had about 50-60 servers returning cron-scheduled check
results to the Nagios server. 120 results ain't that much, but is seemed
that with all the servers fully time-synced (using NTP) out of these
~120 results I was often missing some of them, which would eventually
cause false-alarm due to stale services.
I could easily reproduce the problem by feeding lots of results to
Nagios right when I was expecting a batch of passive results - this
would cause random results to be dropped. I spent some time trying to
debug this but I couldn't figure our where commands were dropped. My
primary target was the ring buffer used by the command reaper. As far as
I can remember I tested with version of Nagios ranging from 2.3 to 2.5;
I never tried with recent version
If you're running a recent version of nagios what do you get for
"Used/High/Total Command Buffers" in the "nagiostats" command output?
(you can also get these numbers from the web interface, "Performance
Info" in the left bar.). If it seems to be maxed out, you may try
setting "command_check_interval" to "-1" and raising the
"external_command_buffer_slots" option in nagios.cfg.
If you're still having this problem with Nagios v3 and up I might try to
reproduce this as well, and maybe I'll be able to figure out what's
wrong this time.
--
Thomas
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list