High CPU utilization at random times
Andreas Ericsson
ae at op5.se
Wed Oct 5 11:57:50 CEST 2005
Dan Wilson wrote:
> I've been looking into a problem for quite some time now and have come
> up stumped. Every time I think I know what the problem is I turn out to
> be wrong.
>
> Sorry, this is LONG but has lots of detail, hopefully all the detail you
> guys need to make a diagnosis or point me in the right direction :-)
>
> PROBLEM:
> Randomly, and for no good reason, the CPU usage on this machine will go
> up to anywhere from .7 to 1.5!?!?!?!?!?
>
> HARDWARE:
> PIII 677
> 384MB ram
> Software RAID 1 with IDE(all partitions except swap, yes, I boot from it
> too... I already took crap for booting from software raid, but it works
> fine, really)
> extra drive for swap and nightly "snapshots" of /usr/local/ and /etc and
> a few other things.
>
> SOFTWARE:
> Mandrake linux 10.1(last updates 45 days ago)
> Nagios 1.2 (no perl interpreter, with perl cache)
I don't think you could have the perl cache without the perl interpreter...
> Plugins 1.3.1
> Optional/custom plugins...
> check_icmp instead of check_ping
Early incantations of check_icmp could end up in an infinite loop if it
timed out and entered the finish() function. This ofcourse ups the load
no end, until Nagios kills it off with SIGKILL. Try upgrading it from
the package at http://oss.op5.se/nagios/op5plugins-2005-09-27.tar.gz
AFAIK, this bug was only ever present in a version of check_icmp which
specifically wasn't intended for production use, but was tested by a
number of friendly helpers (all mentioned in check_icmp.c).
> custom check_ink script/plugin - this plugin is written in perl and uses
> the netsnmp module for perl. This isn't the problem either, stopped all
> service checks that used it for a few hours, the problem was still
> there.... FYI: This script checks supply levels in network printers, I
> could have used the check_snmp plugin for this but that was too messy(i
> tried!). This way the out put is cleaner(ex. Levels OK - C-34% Y-75%
> M-12% K-90%) and there is only one check per printer instead of one for
> each supply :-) [my programming skills suck, really, they do. You have
> to specify the type of printer which has to be put in the script so if
> can correctly read the supplies... I should have written it to
> "explore" the printer to see what kind of supplies it had and what could
> be checked so it would in theory work with any printer... but it works
> the way it is, and I couldn't figure out how to get everything to
> work... I'm learning and will some day get it to work the way I want????]
> check_smart - checks HDD SMART values... not the trouble either, it was
> added recently after a HDD went bad and the box crashed 2 nights in a
> row(the extra drive was bad and failed during the "snapshot")
>
> The follwing were the latest stable versions as of about Feb-2005
> Apache
> MRTG
> NetSNMP
> PERL
> PHP
> MySQL
>
>
> THINGS I HAVE DONE/LOOKED AT TO TRY AND FIX THIS ISSUE:
>
> Recompiled the kernel... no change, went back to the standard kernel.
>
> Restarted like a MS machine... uptime makes no difference, pleanty of
> memory availble(150+MB) all the time
>
This seems to indicate an infinite loop problem in some small piece of
software then. Believe me, it can eat load *fast*.
> Nagios - stopped the service, no issue, start the service and let it run
> a while, the problem appears... I recompiled(twice), adjusted a few
> options, no luck with the issue though nagios ran a tiny faster, maybe
> 1-2%, not worth the wait to recompile IMHO
>
Did you happen to notice if this coincided with a host going down or in
some other way not being able to respond to ping? The host check (or
ping service check) output would be something along the lines of "Plugin
timed out" if it was down to check_icmp.
> MRTG - checking interface on 2 routers, it is using RRD and the
> MRTG-RRD.CGI fast cgi script so the load from this every 5 minutes isn't
> even worth mentioning. Tried removing access from users to stop
> MRTG-RRD.CGI from generating graphs on demand. I even tried stopping
> MRTG and lost 4 hours of data but still had the problem.
>
> Apache - stopped the service, problem still continues.
>
> PERL - recompiled and removed a few options that the documentation said
> could cause trouble, no change. Even ran Nagios without any perl
> scripts/plugins, problem still there.
>
> PHP - nothing is using this at the moment... was only installed for
> testing a Nagios config utility with a web interface...
>
> MySQL - not being used, makes no difference if it is running or not.
>
> I only run X while downloading updates, otherwise it stays off and I
> just SSH in.
>
>
> MORE INFO:
>
> At first I only noticed it when I would SSH in and look at the load
> because it took 15+seconds to log in. I though it was SSH to I started
> having Nagios check the CPU load, I can look from time to time and catch
> it up nice and high.
>
> It is NOT logs being rotated, excessive swaping, bad hardware(second
> machine it's happened on), too many people accessing the box, too many
> services/hosts down.(I'm checking about 90 hosts and 180+ services,
> after I delete the retention data and start Nagios fresh everything is
> checked and fine in 2 minutes or less.).
>
> It's not to the point where the box is unusable, it clears up in a
> minute or two(always, every time, and that makes it hard to track down).
>
> It is NOT(at least not that I can tell) Nagios making excessive retries
> on problems, it happens when there are no problem and I have the max
> retries set to 3 for all but a few things. Timeouts are 10 seconds or
> less on all but one check. I'm not using obssesive checks, processing
> perf data or anything like that.
>
> When I first installed nagios 2 years ago I tinkered with getting it to
> respond faster, I set the time period to 15 seconds(default is 60?) so I
> could get a few things running every 15 or 30 seconds... works great and
> with little increased overhead.... I just have to remember that 1
> minute is now 4 and not 1... ;-) Nagios responds like a champ now,
> forced checks don't take a minute or longer... 20 seconds at the
> longest. I HATE WAITING! LOL
>
>
>
>
> Any ideas? Or should I just live with it until I upgrade to 2.0? I'll
> be moving to faster hardware then anyway, dual PIII 700 with 2GB ram and
> hardware RAID1... It's not much but it is better :-)
>
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads, discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. ::: Messages without supporting info will risk
> being sent to /dev/null
>
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list