trying to fix problem with excessive latency
Frost, Mark {PBC}
mark.frost1 at pepsico.com
Wed May 19 17:43:47 CEST 2010
> -----Original Message-----
> From: Corey Hickey [mailto:bugfood-ml at fatooh.org]
> Sent: Tuesday, May 18, 2010 9:30 PM
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] trying to fix problem with excessive latency
>
> Hello,
>
> I have inherited maintenance of a medium-sized Nagios installation. We
> currently have 649 hosts and 5415 services. Our setup works nicely, with
> one exception: Nagios falls behind on host/service checks. Our usual
> latency once Nagios has been running for a while is about 190-200
> seconds. Our Nagios host is reasonably powerful and isn't struggling; it
> seems that Nagios itself is limited somehow.
>
<snip>
> Active Service Execution Time: 0.020 / 120.007 / 0.847 sec
> Active Host Execution Time: 0.020 / 11.019 / 0.069 sec
>
<snip>
> I have a feeling I'm missing something.... I would appreciate any
> suggestions.
>
> Thanks,
> Corey
Corey,
I'm not an expert, but I'll relay some of my own experiences here. I did
find that switching on large_installation_tweaks did indeed make a big difference
with our latencies.
We also were doing the pre-Nagios 3.2 practice of not doing active host checks. As
the tuning guide recommends, it's actually more efficient to do active checks and then
enable the cached check results. When we did that, we found that the host that we
were seeing latency issues on leveled out on latencies. (It's good to graph those values,
by the way). They were still high-ish, but the active host checks caused them
to stop increasing over time.
But additionally, we found that long running checks were also messing up latencies.
As I understand it, if Nagios schedules a check and then it takes a lot longer than Nagios
expects it to to return, that can mess up scheduling the other checks. I see you've got
some check(s) that ran at a max of 120 seconds. When I started seeing some latency
problems I also saw that I had a service check or two that was running for several minutes.
I tracked that down and changed the check so that it completed (or timed out, really)
more quickly returning status back to Nagios in a matter of seconds rather than minutes.
The latency plummeted after that. In general, our policy is that most checks should
complete in under 30 seconds, preferably under 10.
In the same vein, I'm not quite sure how you could have any host checks that would take
11 seconds to execute. Are you doing multiple pings/fpings to check that a host is up? Typically you can get away with just a single fping rather than a series of 10 to tell
you that a host is not reachable.
Hope that helps.
Mark
------------------------------------------------------------------------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list