High CPU utilization at random times
Dan Wilson
drw at adnc.com
Wed Oct 5 06:40:09 CEST 2005
I've been looking into a problem for quite some time now and have come
up stumped. Every time I think I know what the problem is I turn out to
be wrong.
Sorry, this is LONG but has lots of detail, hopefully all the detail you
guys need to make a diagnosis or point me in the right direction :-)
PROBLEM:
Randomly, and for no good reason, the CPU usage on this machine will go
up to anywhere from .7 to 1.5!?!?!?!?!?
HARDWARE:
PIII 677
384MB ram
Software RAID 1 with IDE(all partitions except swap, yes, I boot from it
too... I already took crap for booting from software raid, but it works
fine, really)
extra drive for swap and nightly "snapshots" of /usr/local/ and /etc and
a few other things.
SOFTWARE:
Mandrake linux 10.1(last updates 45 days ago)
Nagios 1.2 (no perl interpreter, with perl cache)
Plugins 1.3.1
Optional/custom plugins...
check_icmp instead of check_ping
custom check_ink script/plugin - this plugin is written in perl and uses
the netsnmp module for perl. This isn't the problem either, stopped all
service checks that used it for a few hours, the problem was still
there.... FYI: This script checks supply levels in network printers, I
could have used the check_snmp plugin for this but that was too messy(i
tried!). This way the out put is cleaner(ex. Levels OK - C-34% Y-75%
M-12% K-90%) and there is only one check per printer instead of one for
each supply :-) [my programming skills suck, really, they do. You have
to specify the type of printer which has to be put in the script so if
can correctly read the supplies... I should have written it to
"explore" the printer to see what kind of supplies it had and what could
be checked so it would in theory work with any printer... but it works
the way it is, and I couldn't figure out how to get everything to
work... I'm learning and will some day get it to work the way I want????]
check_smart - checks HDD SMART values... not the trouble either, it was
added recently after a HDD went bad and the box crashed 2 nights in a
row(the extra drive was bad and failed during the "snapshot")
The follwing were the latest stable versions as of about Feb-2005
Apache
MRTG
NetSNMP
PERL
PHP
MySQL
THINGS I HAVE DONE/LOOKED AT TO TRY AND FIX THIS ISSUE:
Recompiled the kernel... no change, went back to the standard kernel.
Restarted like a MS machine... uptime makes no difference, pleanty of
memory availble(150+MB) all the time
Nagios - stopped the service, no issue, start the service and let it run
a while, the problem appears... I recompiled(twice), adjusted a few
options, no luck with the issue though nagios ran a tiny faster, maybe
1-2%, not worth the wait to recompile IMHO
MRTG - checking interface on 2 routers, it is using RRD and the
MRTG-RRD.CGI fast cgi script so the load from this every 5 minutes isn't
even worth mentioning. Tried removing access from users to stop
MRTG-RRD.CGI from generating graphs on demand. I even tried stopping
MRTG and lost 4 hours of data but still had the problem.
Apache - stopped the service, problem still continues.
PERL - recompiled and removed a few options that the documentation said
could cause trouble, no change. Even ran Nagios without any perl
scripts/plugins, problem still there.
PHP - nothing is using this at the moment... was only installed for
testing a Nagios config utility with a web interface...
MySQL - not being used, makes no difference if it is running or not.
I only run X while downloading updates, otherwise it stays off and I
just SSH in.
MORE INFO:
At first I only noticed it when I would SSH in and look at the load
because it took 15+seconds to log in. I though it was SSH to I started
having Nagios check the CPU load, I can look from time to time and catch
it up nice and high.
It is NOT logs being rotated, excessive swaping, bad hardware(second
machine it's happened on), too many people accessing the box, too many
services/hosts down.(I'm checking about 90 hosts and 180+ services,
after I delete the retention data and start Nagios fresh everything is
checked and fine in 2 minutes or less.).
It's not to the point where the box is unusable, it clears up in a
minute or two(always, every time, and that makes it hard to track down).
It is NOT(at least not that I can tell) Nagios making excessive retries
on problems, it happens when there are no problem and I have the max
retries set to 3 for all but a few things. Timeouts are 10 seconds or
less on all but one check. I'm not using obssesive checks, processing
perf data or anything like that.
When I first installed nagios 2 years ago I tinkered with getting it to
respond faster, I set the time period to 15 seconds(default is 60?) so I
could get a few things running every 15 or 30 seconds... works great and
with little increased overhead.... I just have to remember that 1
minute is now 4 and not 1... ;-) Nagios responds like a champ now,
forced checks don't take a minute or longer... 20 seconds at the
longest. I HATE WAITING! LOL
Any ideas? Or should I just live with it until I upgrade to 2.0? I'll
be moving to faster hardware then anyway, dual PIII 700 with 2GB ram and
hardware RAID1... It's not much but it is better :-)
-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list