High latencies problem.
Alessandro Ren
alessandro.ren at opservices.com.br
Tue Feb 17 20:06:10 CET 2009
On 2/17/2009 3:15 PM, D. Emmanuel Feinsmith wrote:
> Dear Alessandro,
>
> You are more than likely eating up the cpu and memory with the
> memcpy's executed by each fork of your check_nrpe and check_icmp
> services. You can prove this out to yourself by using top to observe
> the behaviour of the nagios processes. I would also suggest that there
> is nothing else eating up CPU and memory on your nagios server box and
> keep the box dedicated. Running top will show if there is resource
> contention on your monitoring server. Keep in mind that check_nrpe is
> amongst the slowest possible commands nagios can execute because it
> has to wait for whatever timeout period you entered in your client
> nrpe.cfg for the nrpe daemon to respond. This can take seconds in some
> cases. A much more scalable solution is to enable passive checks
> (using nsca/send_nsca) on some or all of your clients)
>
> I would suggest the following things (from the nagios performance
> tuning guide):
>
> # *Check service latencies* to determine best value for maximum
> concurrent checks. Nagios can restrict the number of maximum
> concurrently executing service checks to the value you specify with
> the max_concurrent_checks option. This is good because it gives you
> some control over how much load Nagios will impose on your monitoring
> host, but it can also slow things down. If you are seeing high latency
> values (> 10 or 15 seconds) for the majority of your service checks
> (via the extinfo CGI), you are probably starving Nagios of the checks
> it needs. That's not Nagios's fault - its yours. Under ideal
> conditions, all service checks would have a latency of 0, meaning they
> were executed at the exact time that they were scheduled to be
> executed. However, it is normal for some checks to have small latency
> values. I would recommend taking the minimum number of maximum
> concurrent checks reported when running Nagios with the -s command
> line argument and doubling it. Keep increasing it until the average
> check latency for your services is fairly low.
>
> # *Optimize host check commands*. If you're checking host states using
> the check_ping plugin you'll find that host checks will be performed
> much faster if you break up the checks. Instead of specifying a
> max_attempts value of 1 in the host definition and having the
> check_ping plugin send 10 ICMP packets to the host, it would be much
> faster to set the max_attempts value to 10 and only send out 1 ICMP
> packet each time. This is due to the fact that Nagios can often
> determine the status of a host after executing the plugin once, so you
> want to make the first check as fast as possible. This method does
> have its pitfalls in some situations (i.e. hosts that are slow to
> respond may be assumed to be down), but you'll see faster host checks
> if you use it. Another option would be to use a faster plugin (i.e.
> check_fping) as the host_check_command instead of check_ping.
>
> # *Schedule regular host checks.* Scheduling regular checks of hosts
> can actually help performance in Nagios. This is due to the way the
> cached check logic works (see below). Prior to Nagios 3, regularly
> scheduled host checks used to result in a big performance hit. This is
> no longer the case, as host checks are run in parallel - just like
> service checks. To schedule regular checks of a host, set the
> check_interval directive in the host definition to something greater
> than 0.
>
> # *Enable cached host checks*. Beginning in Nagios 3, on-demand host
> checks can benefit from caching. On-demand host checks are performed
> whenever Nagios detects a service state change. These on-demand checks
> are executed because Nagios wants to know if the host associated with
> the service changed state. By enabling cached host checks, you can
> optimize performance. In some cases, Nagios may be able to used the
> old/cached state of the host, rather than actually executing a host
> check command. This can speed things up and reduce load on monitoring
> server. In order for cached checks to be effective, you need to
> schedule regular checks of your hosts (see above). More information on
> cached checks can be found here.
>
> For more, see:
>
> /http://nagios.sourceforge.net/docs/3_0/tuning.html/
Daniel,
I've read this DOC more than once in my search to bring the latency
down.
Passive checks are not a possilibity right now, maybe with another
nagios instance, this would be OK.
I am trying to avoid having to use another nagios instance for now,
but I have this option also in mind.
I've already used max_concurrent_checks=0 and I've not noticed any
change in latency times.
Tks.
>
> If none of this works, you may have to use passive checks or multiple
> nagios instances to drop your latency.
>
> Bon Chance!
> Daniel.
>
> On Feb 17, 2009, at 8:41 AM, Alessandro Ren wrote:
>
>> On 2/17/2009 1:32 PM, D. Emmanuel Feinsmith wrote:
>>
>> Answers bellow,
>>> Alessandro,
>>>
>>> 1. what is the breakdown between passive and active checks? For
>>> passive checks, there are many ways to increase the # of services
>>> through bypassing the command pipe (which nsca writes to). With
>>> passive checks done in this way I've gone to 50,000 services with
>>> under 10 second latency.
>>>
>> All active checks, no passive.
>>
>>> 2. how many of those services are check_icmp or check_ping? If there
>>> is a good number of those, you can use fping to reduce the # of fork/
>>> exec's that nagios has to perform, which is a major area of resource
>>> utilization within the nagios server.
>>>
>> Less than 5% are ping checks and we use check_icmp for all those.
>> Most checks are check_nrpe,.
>>
>>> 3. Are you using a performance data handler or OCSP? If so, you might
>>> either find a way to get rid of these entirely, or be sure you are
>>> using file based performance handling at the very minimum.
>>>
>> I am using perfparse to write to mysql. Disabling it has no effect
>> in the latency.
>>
>>> The key to nagios scalability and latency reduction is to educe the #
>>> of fork/exec's to the smallest amount possible and keep away from the
>>> command pipe as much as you can if you are passive-check heavy. If you
>>> are using all active checks, then you can balance the load between
>>> active and passive checks and thereby gain some speed.
>>>
>>
>> In my other nagios with just 2600 services, we see around 200
>> nagios processes running in average, in the 11600 services system, the
>> average is 30 processes, it seems that the event loop in lagging, is is
>> not starting enough processes, thus raising the latency.
>>
>> Thank you Daniel.
>>> Daniel.
>>>
>>> On Feb 17, 2009, at 8:17 AM, Alessandro Ren wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I have a nagios system running with 427 hosts and 11160 services and
>>>> since I reached 8000 services, I am having problems with the latency
>>>> beeing around 100s and 200s.
>>>> use_large_installation_tweaks is enabled, max_concurrent_checks
>>>> have
>>>> been tested with 0 and higher values and I have tested this setup in
>>>> two
>>>> different HWs, a dual core with 4GB RAM 32 bits a a Dual Xeon Dual
>>>> core
>>>> 64bits with 8GB of RAM. We are using REdHat enterprise 5.
>>>> Also reaper is already at 2s, host checks with cache horizon are
>>>> enabled with a max retry of 3, all services check every 5min.
>>>> I have no service dependency set up.
>>>> I've noticed that nagios is not spawning too many processes as
>>>> another nagios I have running which has far less servicexs and it
>>>> seems
>>>> that the event loop if lagging behing, in my debugs.
>>>> Any ideas what could I do to fix that? Have I reached a limit in
>>>> nagios pooler code?
>>>>
>>>> Tks.
>>>>
>>>> --
>>>> Alessandro Ren
>>>> http://www.opservices.com.br
>>>> alessandro.ren at opservices.com.br
>>>> <mailto:alessandro.ren at opservices.com.br>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Open Source Business Conference (OSBC), March 24-25, 2009, San
>>>> Francisco, CA
>>>> -OSBC tackles the biggest issue in open source: Open Sourcing the
>>>> Enterprise
>>>> -Strategies to boost innovation and cut costs with open source
>>>> participation
>>>> -Receive a $600 discount off the registration fee with the source
>>>> code: SFAD
>>>> http://p.sf.net/sfu/XcvMzF8H
>>>> _______________________________________________
>>>> Nagios-devel mailing list
>>>> Nagios-devel at lists.sourceforge.net
>>>> <mailto:Nagios-devel at lists.sourceforge.net>
>>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Open Source Business Conference (OSBC), March 24-25, 2009, San
>>> Francisco, CA
>>> -OSBC tackles the biggest issue in open source: Open Sourcing the
>>> Enterprise
>>> -Strategies to boost innovation and cut costs with open source
>>> participation
>>> -Receive a $600 discount off the registration fee with the source
>>> code: SFAD
>>> http://p.sf.net/sfu/XcvMzF8H
>>> _______________________________________________
>>> Nagios-devel mailing list
>>> Nagios-devel at lists.sourceforge.net
>>> <mailto:Nagios-devel at lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>>
>>
>> ------------------------------------------------------------------------------
>> Open Source Business Conference (OSBC), March 24-25, 2009, San
>> Francisco, CA
>> -OSBC tackles the biggest issue in open source: Open Sourcing the
>> Enterprise
>> -Strategies to boost innovation and cut costs with open source
>> participation
>> -Receive a $600 discount off the registration fee with the source
>> code: SFAD
>> http://p.sf.net/sfu/XcvMzF8H
>> _______________________________________________
>> Nagios-devel mailing list
>> Nagios-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
More information about the Developers
mailing list