Scheduled checks falling far behind
Litwin, Matthew
mlitwin at stubhub.com
Mon Oct 25 01:19:33 CEST 2010
On Oct 24, 2010, at 3:02 PM, Andreas Ericsson wrote:
> On 10/24/2010 10:14 PM, Litwin, Matthew wrote:
>> Hi Matthieu (and anyone else who might want to throw their hat into
>> the ring):
>>
>
> I'll chip in. Your MUA seems to not wrap lines at all though, which
> makes replying inline a bit tricky.
Sorry. Blame Apple. :-)
>
> Note that you should wipe your status.sav files between restarts to
> not let old latency affect the numbers you're seeing.
I don't seem to have them on my system.
>
> What system are you running this on? Nagios has been known to have
> issues with older non-linux systems where thread libraries aren't
> as forgiving as the nptl library shipped with glibc. Also, Nagios
> should never run as a virtual guest.
It is a 8 core x86 server running CentOS 5.3
> As for the check_result_reaper_frequency things, we ship those unset
> so they take the Nagios defaults. We used to have it at 2. I'm unsure
> if removing the setting was a conscious choice or just by accident.
I will give it a try, thanks.
>
> In general, you should keep your performance-data and checkresult
> files on ramdisks. That will help preventing IO from becoming a
> bottleneck.
IO wait on the sever is is on average 1% so I doubt that is the problem, but certainly worth investigating.
>
>
>> So after identifying that I have latency times that are around
>> 500-600 seconds I have tried the tuning tips form the nagios docs,
>> however I have fiddled with it and it while after the restart latency
>> drops briefly, then just comes back up to the high levels again. At
>> this point I have only been working with check_reaper_frequency and
>> max_check_result_reaper_time by doubling and halving them from their
>> default values. max_concurrent_checks remains at 0. Load on the
>> server is very low. The machine is a 8 core machine so I really wish
>> I could make better use of it. Load is a measly 1.5 on average.
>> Finally, I tried enable_environment_macros = 0 which actually made it
>> worse, once things quiesced after startup.
>> use_large_installation_tweaks=1 did improve the latency by maybe %30
>> and I did actually start seeing RRD data come in solid for about 15
>> minutes but then it returned to being sparse again so while a modest
>> improvement, it still doesn't fill RRD data to have useful data.
>>
>> Any other tuning suggestions? I think I have done everything in the
>> performance tweaks section that seems relevant, including all of
>> those that have been suggested here.
>>
>
> Make sure you haven't got "parallelize_check" set to 0 anywhere. That
> will make Nagios try to run the checks one at a time, which obviously
> doesn't work too well. If that's the case, you should have a latency
> that corresponds to the amount of checks you're running times the
> average check execution time minus the normal check-interval.
>
> In other words; If you've got 900 checks in total, the average check
> execution time is 1 second and you plan to run all checks in a 5 minute
> interval (300 secs), you should get a latency of roughly 600 seconds.
>
> If you've got it set for a few checks, Nagios will still fail to run
> any other checks during the time the unparallelizeable check runs,
> but it doesn't check if such checks are scheduled at the same time as
> other checks when it schedules them, so latency will always be a bit
> higher when not all checks are run in parallel.
>
>> In summary, I am looking for some way to make nagios "do more" with
>> the system resources as the host is barely working at all. I really
>> wish there was some way to just make nagios to have some ability to
>> do things more in parallel for cases where a system has plenty of
>> horsepower and RAM. If I have to resort to compiling things with
>> different settings I would be open to trying it, but I just feel like
>> I am grasping at straws now.
>>
>
> Are you using any eventbroker modules? If so, which ones and what
> happens when you disable them?
Not that I know of.
>
> What happens when you disable performance-data parsing and writing?
Actually, that was what I am trying to get working properly. My RRD data files are sparse as a result.
>
> Is the system running as a virtual guest?
No, it is a hard server.
>
> Do you have any checks with a check_interval that differs wildly
> from the average check_interval?
All of my check_interval settings are 5 with a few that are a little bit less.
I am running 3.2.1
Documentation suggest I set the check_interval for hosts to 0. Is that appropriate?
> A while back there was a bug
> that caused Nagios to spread the first service-check in a window
> as big as the largest check_interval. Once all checks had been
> executed, latency slowly normalized again. This doesn't seem to
> match what you're describing, but it could be a similar bug
> somewhere else. Using the same check_interval for all hosts and
> services should tell if that's the case.
>
> --
> Andreas Ericsson andreas.ericsson at op5.se
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
>
> Considering the successes of the wars on alcohol, poverty, drugs and
> terror, I think we should give some serious thought to declaring war
> on peace.
------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list