Determining what is causing a highloadreportedby check_load plugin

Kaplan, Andrew H. AHKAPLAN at PARTNERS.ORG
Tue Dec 7 20:33:43 CET 2010
Previous message: Determining what is causing a highloadreportedby check_load plugin
Next message: NRPE
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi there --
 
I ran the command syntax you suggested, and outputted it to a file. When I
checked the file, I noticed there
was a large amount of updatedb and slocate instances that were running going
back to August of this year. 
When I tried to kill those processes, I ran into the same problem that I
encountered with the kjournald instances.
 
I did some further investigating, and it turns out a high number of the updatedb
and slocate processes are
symptomatic of a corrupted filesystem. Accordingly, I rebooted the server and
had it run fsck on all filesystems.
The server is now up, and I will monitor it for the next week to see if the
problem returns.
 
 

________________________________

From: Rick Mangus [mailto:rick.mangus+nagios at gmail.com] 
Sent: Tuesday, December 07, 2010 10:49 AM
To: Nagios Users List
Subject: Re: [Nagios-users] Determining what is causing a highloadreportedby
check_load plugin


Kjournald is needed for journalling on ext3 filesystems.  Be glad you didn't
manage to kill them.

To find something that is running many many instances, try this: "ps -ax -o cmd
| sort | uniq -c | sort -n"

The output will be like so:
      3 [kjournald]
      3 [sh] <defunct>
      5 -bash
      7 crond

The column on the left is the number of processes with that command line.  I
occasionally have 10,000 instances of nsca that simply need to be killed.  Do
let us know what you find!

--Rick


On Tue, Dec 7, 2010 at 9:25 AM, Kaplan, Andrew H. <AHKAPLAN at partners.org> wrote:


	Hi there --
	 
	The output shown below shows the top processes on the server:
	 
	439 processes: 438 sleeping, 1 running, 0 zombie, 0 stopped
	CPU0 states: 19.0% user,  9.4% system,  0.0% nice, 71.0% idle
	CPU1 states: 20.1% user, 13.0% system,  0.0% nice, 66.3% idle
	CPU2 states: 27.1% user, 17.3% system,  0.0% nice, 55.0% idle
	Mem:  2064324K av, 2013820K used,   50504K free,       0K shrd,  487764K
buff
	Swap: 2096472K av,   12436K used, 2084036K free                  976244K
cached
	 
	  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
	 2398 root      15   0  1280 1280   824 R     1.9  0.0   0:00 top
	 5648 root      22   0  1196 1196  1104 S     1.3  0.0   0:00
ASMProServer
	    1 root      15   0   488  484   448 S     0.0  0.0   2:28 init
	    2 root      0K   0     0    0     0 SW    0.0  0.0   0:00
migration_CPU0
	    3 root      0K   0     0    0     0 SW    0.0  0.0   0:00
migration_CPU1
	    4 root      0K   0     0    0     0 SW    0.0  0.0   0:00
migration_CPU2
	    5 root      15   0     0    0     0 SW    0.0  0.0   0:03 keventd
	    6 root      34  19     0    0     0 SWN   0.0  0.0  17:52
ksoftirqd_CPU0
	    7 root      34  19     0    0     0 SWN   0.0  0.0  16:39
ksoftirqd_CPU1
	    8 root      34  19     0    0     0 SWN   0.0  0.0  17:33
ksoftirqd_CPU2
	    9 root      15   0     0    0     0 SW    0.0  0.0  28:22 kswapd
	   10 root      15   0     0    0     0 SW    0.0  0.0  42:39 bdflush
	   11 root      15   0     0    0     0 SW    0.0  0.0   3:08 kupdated
	   12 root      25   0     0    0     0 SW    0.0  0.0   0:00
mdrecoveryd
	   18 root      16   0     0    0     0 SW    0.0  0.0   0:00 scsi_eh_0
	   21 root      15   0     0    0     0 SW    0.0  0.0   4:38 kjournald
	  101 root      15   0     0    0     0 SW    0.0  0.0   0:00 khubd
	  265 root      15   0     0    0     0 SW    0.0  0.0   0:03 kjournald
	  266 root      15   0     0    0     0 SW    0.0  0.0   3:43 kjournald
	  267 root      15   0     0    0     0 SW    0.0  0.0   0:04 kjournald
	  268 root      15   0     0    0     0 SW    0.0  0.0   0:01 kjournald
	  269 root      15   0     0    0     0 SW    0.0  0.0   0:11 kjournald
	  270 root      15   0     0    0     0 SW    0.0  0.0   4:34 kjournald
	  271 root      15   0     0    0     0 SW    0.0  0.0   4:28 kjournald
	  272 root      15   0     0    0     0 SW    0.0  0.0   0:08 kjournald
	  273 root      15   0     0    0     0 SW    0.0  0.0   0:14 kjournald
	  274 root      15   0     0    0     0 SW    0.0  0.0   0:07 kjournald
	  275 root      15   0     0    0     0 SW    0.0  0.0   1:14 kjournald
	  805 root      15   0   588  576   532 S     0.0  0.0   1:39 syslogd
	  810 root      15   0   448  432   432 S     0.0  0.0   0:00 klogd
	  830 rpc       15   0   596  572   508 S     0.0  0.0   0:04 portmap
	  858 rpcuser   19   0   708  608   608 S     0.0  0.0   0:00 rpc.statd
	  970 root      15   0     0    0     0 SW    0.0  0.0   0:21 rpciod
	  971 root      15   0     0    0     0 SW    0.0  0.0   0:00 lockd
	  999 ntp       15   0  1812 1812  1732 S     0.0  0.0   5:04 ntpd
	 1022 root      15   0   772  720   632 S     0.0  0.0   0:00 ypbind
	 1024 root      15   0   772  720   632 S     0.0  0.0   1:16 ypbind
	 
	What caught my eye was the number of processes along with the number of
sleeping processes.
	I tried running the kill command on the kjournald instances, but that
did not appear to stop them.
	 
	Aside from rebooting the server, which can be done if necessary, what
other approach can I try?
	 
	 


________________________________

	
	From: Daniel Wittenberg [mailto:daniel.wittenberg.r0ko at statefarm.com] 
	
	Sent: Tuesday, December 07, 2010 9:11 AM 

	To: Nagios Users List
	Subject: Re: [Nagios-users] Determining what is causing a
highloadreportedby check_load plugin
	


	So what are the first few processes listed in top?  That should be what
is causing your load then.

	 

	Dan

	 

	 

	 

	From: Kaplan, Andrew H. [mailto:AHKAPLAN at PARTNERS.ORG] 
	Sent: Tuesday, December 07, 2010 7:49 AM
	To: Nagios Users List
	Subject: Re: [Nagios-users] Determining what is causing a high
loadreportedby check_load plugin

	 

	Hi there --

	 

	The load values that are displayed in top match those for the check_load
plugin. This is the case whether the plugin

	is run either automatically or interactively. The output for the uptime
command is shown below:

	 

	8:48am  up 153 days, 23:21,  1 user,  load average: 73.36, 73.29, 73.21

	 

	 

	 

	 

________________________________

	From: Daniel Wittenberg [mailto:daniel.wittenberg.r0ko at statefarm.com] 
	Sent: Monday, December 06, 2010 4:40 PM
	To: Nagios Users List
	Subject: Re: [Nagios-users] Determining what is causing a high load
reportedby check_load plugin

	In top, does it show the same load values?  The status of your memory
shouldn't cause the nagios plugin to report high cpu.  What does the uptime
command say?  Try running the check_load script by hand on that host and verify
it returns the same results.

	
	Dan

	 

	 

	From: Marc Powell [mailto:lists at xodus.org] 
	Sent: Monday, December 06, 2010 3:26 PM
	To: Nagios Users List
	Subject: Re: [Nagios-users] Determining what is causing a high load
reported by check_load plugin

	 

	 

	On Mon, Dec 6, 2010 at 1:50 PM, Kaplan, Andrew H.
<AHKAPLAN at partners.org> wrote:

	Hi there -- 

	We are running Nagios 3.1.2 server, and the client that is the subject
of this e-mail is running version 2.6 of the nrpe client.

	The check_load plugin, version 1.4, is indicating the past three
readings are the following: 

	load average: 71.00, 71.00, 70.95 CRITICAL 

	The critical threshold of the plugin has been set to the 30, 25, 20
settings. 

	When I checked the client in question, the first thing I did was to run
the top command. The results are shown below: 

	CPU0 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle 
	CPU1 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle 
	CPU2 states:  1.0% user,  4.0% system,  0.0% nice, 93.0% idle 
	Mem:  2064324K av, 2032308K used,   32016K free,       0K shrd,  509924K
buff 
	Swap: 2096472K av,   21432K used, 2075040K free                 1035592K
cached 

	The one thing that I noticed was the amount of free memory was at
thirty-two megabytes. I wanted to know if that was 
	what was causing the critical status to occur, or if there is
something(s) else that I should investigate.

	
	Memory is not a factor in the load calculation, only the number of
processes running or waiting to run. For at least 15 minutes you had
approximately 71 processes either running or ready to run and waiting on CPU
resources. Running top/ps was the right thing to do but you really need to do it
when the problem is occurring to see what's actually using all the CPU
resources. There are far too many reasons why load could be high but it should
be easy for someone familiar with your system to figure it out (at least
generally) while in-the-act.
	
	--
	Marc

	
	
	The information in this e-mail is intended only for the person to whom
it is
	addressed. If you believe this e-mail was sent to you in error and the
e-mail
	contains patient information, please contact the Partners Compliance
HelpLine at
	http://www.partners.org/complianceline . If the e-mail was sent to you
in error
	but does not contain patient information, please contact the sender and
properly
	dispose of the e-mail.


	
------------------------------------------------------------------------------
	What happens now with your Lotus Notes apps - do you make another costly
	upgrade, or settle for being marooned without product support? Time to
move
	off Lotus Notes and onto the cloud with Force.com, apps are easier to
build,
	use, and manage than apps on traditional platforms. Sign up for the
Lotus
	Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
	_______________________________________________
	Nagios-users mailing list
	Nagios-users at lists.sourceforge.net
	https://lists.sourceforge.net/lists/listinfo/nagios-users
	::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
	::: Messages without supporting info will risk being sent to /dev/null
	


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20101207/e76fae89/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
What happens now with your Lotus Notes apps - do you make another costly 
upgrade, or settle for being marooned without product support? Time to move
off Lotus Notes and onto the cloud with Force.com, apps are easier to build,
use, and manage than apps on traditional platforms. Sign up for the Lotus 
Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Determining what is causing a highloadreportedby check_load plugin
Next message: NRPE
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list