Check load plugin configuration on a local machine.
Eric Stanley
estanley at nagios.com
Fri Jul 27 14:01:32 CEST 2012
Bryan,
You're on the right track understanding check_load. There are 3 values
for warning level and 3 values for the critical level, one each for the
1-minute, 5-minute, and 15-minute load averages. For the check_load
plugin, a warning or critical state is achieved if any one (not all
three) of the load average thresholds is exceeded.
Depending on what you're trying to measure, you may want to change your
thresholds. Since the load is the number of processes ready to run
(including those running), the ideal situation is that you have one
process ready to run on each core at all times. In other words, on a 24
core box, if your 1-, 5- and 15-minutes load averages are all 24, you're
perfectly utilizing all of your CPU capacity.
Assuming you're monitoring for excessive load, you'll probably want to
set your thresholds higher than the number of cores. Based on
experience, I've set warning thresholds for systems I monitor to 9n, 6n,
and 3n for 1-, 5-, and 15-minute load averages respectively and the
critical thresholds to 15n, 10n, and 5n, where n is the number of cores.
These may seem like very high thresholds, especially for the shorter
duration averages, but I can tolerate short spikes in load. It's long
term excessive loads that concern me. Again, this is based on
experience; prior to implementing these settings, I was getting a lot of
alerts and much less sleep. :-)
Hope that helps.
Eric
On 7/26/2012 3:08 PM, bryan hunt wrote:
> I've got a 24 core box over here, obviously I need to tweak the
> configuration of the check_load plugin as it seems designed for a single
> core machine by default.
>
> define service{
> use generic-service
> host_name localhost
> service_description Current Load
> check_command check_load!20!18!16!22!19!18
> }
>
>
>
> My understanding is that this breaks down as follows
>
> 1, 5, 15 minute load average.
>
> I've set it to the following.
>
> Warning thresholds. (17 is 70% of 24)
> 20!18!16
>
> So warn if it is currently 20, or averaging 17.
>
> Critical thresholds.
> 22!19!18
>
> Only one core, not maxed out, bad. Average above 22, bad.
>
> Anyhow, my question is. Is this a sane configuration. It's pretty
> generous with load. My usual load average is actually:
>
> 1.88 2.08 2.16
>
> Any advice appreciated,
>
> Bryan Hunt
>
>
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
--
Eric Stanley
___
Developer
Nagios Enterprises, LLC
Email: estanley at nagios.com
Web: www.nagios.com
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list