High latencies problem.
D. Emmanuel Feinsmith
daniel at danielemmanuelfeinsmith.com
Tue Feb 17 17:32:32 CET 2009
Alessandro,
It is not a limitation in the poller code proper that you are running
into. Generally speaking, the internal nagios code is plenty
efficient. Its inefficiency primarily arises when it needs to either
receive passive checks (through the command pipe bottleneck) or
execute active checks (fork/exec bottleneck) or other fork/exec's,
such as performance data handlers, ocsp handlers, etc...
To answer your question, I need to ask some questions to understand
your configuration.
1. what is the breakdown between passive and active checks? For
passive checks, there are many ways to increase the # of services
through bypassing the command pipe (which nsca writes to). With
passive checks done in this way I've gone to 50,000 services with
under 10 second latency.
2. how many of those services are check_icmp or check_ping? If there
is a good number of those, you can use fping to reduce the # of fork/
exec's that nagios has to perform, which is a major area of resource
utilization within the nagios server.
3. Are you using a performance data handler or OCSP? If so, you might
either find a way to get rid of these entirely, or be sure you are
using file based performance handling at the very minimum.
The key to nagios scalability and latency reduction is to educe the #
of fork/exec's to the smallest amount possible and keep away from the
command pipe as much as you can if you are passive-check heavy. If you
are using all active checks, then you can balance the load between
active and passive checks and thereby gain some speed.
Daniel.
On Feb 17, 2009, at 8:17 AM, Alessandro Ren wrote:
>
> Hello,
>
> I have a nagios system running with 427 hosts and 11160 services and
> since I reached 8000 services, I am having problems with the latency
> beeing around 100s and 200s.
> use_large_installation_tweaks is enabled, max_concurrent_checks
> have
> been tested with 0 and higher values and I have tested this setup in
> two
> different HWs, a dual core with 4GB RAM 32 bits a a Dual Xeon Dual
> core
> 64bits with 8GB of RAM. We are using REdHat enterprise 5.
> Also reaper is already at 2s, host checks with cache horizon are
> enabled with a max retry of 3, all services check every 5min.
> I have no service dependency set up.
> I've noticed that nagios is not spawning too many processes as
> another nagios I have running which has far less servicexs and it
> seems
> that the event loop if lagging behing, in my debugs.
> Any ideas what could I do to fix that? Have I reached a limit in
> nagios pooler code?
>
> Tks.
>
> --
> Alessandro Ren
> http://www.opservices.com.br
> alessandro.ren at opservices.com.br
>
> ------------------------------------------------------------------------------
> Open Source Business Conference (OSBC), March 24-25, 2009, San
> Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the
> Enterprise
> -Strategies to boost innovation and cut costs with open source
> participation
> -Receive a $600 discount off the registration fee with the source
> code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
More information about the Developers
mailing list