Performance issues, too
Tobias Klausmann
klausman at schwarzvogel.de
Tue Dec 19 12:42:43 CET 2006
Hi!
On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> Thanks for an excellently detailed problem report, missing only the
> Nagios version and system type/version info. I've got some comments and
> followup questions. See below.
I'm running 2.6 now but I had the troubles with 2.5 initially.
OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
2.6.19 today.
> > ---------------------------
> > Total hosts: 330
> > Total scheduled hosts: 0
>
> No scheduled host-checks. That's good, cause they interfere with normal
> operations in Nagios.
I've read as much. In my seperate mail I had a few questions
about it, let's keep them (and the answers there ;)
> > Host inter-check delay method: SMART
> > Average host check interval: 0.00 sec
> > Host inter-check delay: 0.00 sec
> > Max host check spread: 10 min
> > First scheduled check: N/A
> > Last scheduled check: N/A
> >
> >
> > SERVICE SCHEDULING INFORMATION
> > -------------------------------
> > Total services: 2836
> > Total scheduled services: 2836
> > Service inter-check delay method: SMART
> > Average service check interval: 2225.56 sec
>
> This is, as you point out below, quite odd. What's your _longest_
> normal_check_interval for services?
The longest check_interval is 86400 seconds. It's a SSL cert
freshness check. I figured it wasn't necesseary to check that
more often than once a day. I also have check_intervals of 3, 5,
15, 20, 30 and 1440 seconds. The latter is also a cert freshness
check which is lower because the customer wanted it to be that
short.
> > CHECK PROCESSING INFORMATION
> > ----------------------------
> > Service check reaper interval: 5 sec
>
> You could lower this to 2 seconds. I've done so on any number of
> installations and it has no negative impact what so ever, but seems to
> make Nagios a bit more responsive.
I'll give that a try.
> > Max concurrent service checks: Unlimited
>
> I assume you aren't running in to hardware limits on this machine.
> What's the normal load when you're running nagios? If it's > NUM_CPUS
> then you most likely don't have beefy enough hardware. That's hardly
> ever the case though, so don't bother looking into it unless all else fails.
>
> Nvm, question answered below. Hardware resources should be no problem
> what so ever.
I also noticed that HT was disabled on the machine. I've changed
that (and added support for it to the kernel) when I did the
kernel upgrade today. I'll keep an eye on check latency.
> > *Or* it is indicative of a misconfiguration on my
> > part. If the latter is the case, I'd be eager, nay ecstatic to
> > hear what I did wrong. Here are a few of the config vars that
> > might influence this:
>
> There has been a slight thinko in Nagios. I don't know if it's still
> there in recent CVS versions. The thinko is that it (used to?) calculate
> average service check interval by adding up all normal_check_interval
> values and dividing it by the number of services configured (or
> something along those lines), which leads to long latencies. This
> normally didn't make those latencies increase though. Humm...
Well, the numbers sure do get whacky after a restart: first it
skyrockets for about five minutes, then plummets to 1s. From
there it works its way up the way I described.
> > Total Services: 2836
> > Services Checked: 2836
> > Services Scheduled: 2758
> > Active Service Checks: 2836
> > Passive Service Checks: 0
>
> All services aren't being scheduled, but you have no passive service
> checks. Have you disabled checks of 78 services?
Oops, forgot to mention that. Yes, a server farm is being rebuilt
currently. As I didn't want all the host check timeouts to make
matters much, much, worse, I disabled them entirely.
> > Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> > LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> > around 40% idle most of the time. I see about 300 context
> > switches and 500 interrupts per second. The network load is
> > neglible, ditto the packet rate.
> >
> > The way these figures look I don't see a performance problem per
> > se, but maybe I have overlooked a metric that descirbes the
> > "usual" bottleneck of installations.
> >
>
> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel
> cpu's, that causes up to 60% performance loss (yes, it really is that bad).
Sheesh. Yes, it is a 32-bit installation. I only ever bothered
with 64-bit installs on Opteron hardware. I might look into
migrating to 64 bits, then.
> I'm puzzled. Please let me know if you find the answer to this problem.
> I'll help you debug it as best I can, but please continue posting
> on-list. Thanks.
Sure. I'll first check if the "processor upgrade" and kernel
update helped anything, then try lowering the reaper interval to
2. I'll post the results as soon as I have them.
Regards & Thanks,
Tobias
--
Never touch a burning system.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list