Performance issues, too

Tobias Klausmann klausman at
Thu Dec 21 11:24:13 CET 2006


On Thu, 21 Dec 2006, Daniel Meyer wrote:
> - it is not triggered by any other software on the server
>    (nagios and apache are the only things running there)


> - its not triggered by hourly, daily or weekly cronjobs

With a lot of guessing and estimating, I can make a case for a
slight "plateau" right after the hour, with an increase in the
second half of the hour. Might be completely bogus, though.

> - the big service check latency goes away instantly after a restart
>    of nagios


> - the latency skyrockets after "some time", its not like "six hours
>    after the restart" or something like that

Well, not so much as skyrocketing, steadily creeping up. See the
images I reference below.

> - service check execution time does NOT change at all, it stays on
>    the same level all the time

NACK. For me, it starts out at some low-two-digit ms time, then
creeps up to 165.000ms (yes, exactly that value). As far as I can
tell, it stays there forever.

> - changing from a dummy host check to "adaptive" host checks back and
>    forth doesn't make a difference

We didn't try that.

> - i see memory usage rise proportional to the latency, but there is
>    way enough free memory left (this morning it was 150 seconds latency
>    but still 790 Megs free ram, plus one gig cached)

Same (with slightly different figures) here.

> - load on the system rises a little but not much

It's measurable, but definitely not maxed out. Same goes for CPU
utilization (which is something different)>

> - network usage goes down (well there are less checks done due to the
>    latency, so no surprise here)

We haven't checked that but as network traffic (both volume and
packet rate) wasn't near any limit, we didn't feel it was

Here are a few graphs we created for yesterday and the day before

and here are the pics of today and yesterday afternoon:

For all graphs, check frequency was every 2 minutes. For the
older set, a SNAFU on my part when setting up the RRDs resulted
in reduced resolution. That was fixed with the second set.

"Queue size" is calculated the following way: look at all objects
in the state file (retention.dat, saved every 20s). Every object
with a check time in the past counts as one queue entry.

"Slots"/"Checks completed" is a what nagiostats reports as # of
checks completed in the four timeframes.

Things I noted:

Queue size oscillates wildly. This might be due to my
methodology. Still, one can read a trend from that curve.

Check execution time converges at 106ms. On the spot. I have no
idea why.

Load average and CPU idleness indicate that we don't have a host
performance problem (I also looked over but did not plot stuff
like interrupt rate and context switches, nothing overly high,

For the older graphs, check latency doesn't budge at all for
some time (or it's too little to see it). For the newer graph,
the initial rise is rather steep, then increase slows down a bit.
Still, over the course of hours, it seems linear and shows no
sign of converging.

If anybody is interested in the RRD files used to generate the
graphs, drop me a line.

The picture all of this paints is rather inconclusive. We've
found an oddity in our config I'll relate in another mail (a
check interval of 86400 minutes, that's two months). We have
eliminated that for the newer graphs, however.

In conclusion, I'm at a loss as to why this slow deterioration of
check performance happens. 

A colleague of mine is looking at the Nagios scheduling code (he
thinks the description of the algorithm in the docs is rather
strange). He hasn't reported back yet, though.

All in all, every hint is appreciated.


Never touch a burning system.

Take Surveys. Earn Cash. Influence the Future of IT
Join's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
Nagios-users mailing list
Nagios-users at
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

More information about the Users mailing list