Nagios checkresults queue grows over time
Justin Hitt
hittjw at gmail.com
Fri Feb 8 16:54:30 CET 2008
I have two Nagios 3.0 cr1 systems, (A) on a 2.8ghz Solaris 10 system
with 212 hosts and (B) the other on VPS multiple core system with
2,916 hosts. Both systems, after the initial host check, has it's
[/usr/local/nagios/var/spool/checkresults] grow in size till nagios in
non responsive.
(A) Has a modified configuration with a longer
"cached_host_check_horizon=2700" and
"cached_service_check_horizon=1800". I tried to stretch out the time
frame that checks were accepted.
(B) Has a more standard configuration with reasonable cache counts.
Both systems are using "use_large_installation_tweaks=1" and otherwise
are standardly configured. Each system allows 45 minutes to finish
the host checks. I've also tried this configuration without host
checks.
Both systems have very low CPU utilization after the initial host
check and hardly go over 20% during regular operations.
The checkresults queue does go up and down in the number of 'check'
files, often dropping down as much as 200 checks, the popping backup
twice as much. I've tried tuning the "max_check_result_file_age=3600"
which tends to make the queue last longer.
I'm also purging the queue of files older than 90 minutes with ...
0,15,30,45 * * * * ( /usr/local/bin/find
/usr/local/nagios/var/spool/checkresults -type f -mmin +90 -exec
/bin/rm -f {} \; ) > /dev/null 2>&1
... in the crontab.
Finally, here's what I see in the log files ...
[1202485459] Warning: The check of host 'FQDN0.com' looks like it was
orphaned (results never came back). I'm scheduling an immediate check
of the host...
[1202485459] Warning: The check of host 'FQDN1.com' looks like it was
orphaned (results never came back). I'm scheduling an immediate check
of the host...
[1202485459] Warning: The check of host 'FQDN2.com' looks like it was
orphaned (results never came back). I'm scheduling an immediate check
of the host...
... which again is why I tuned the "max_check_result_file" and am
purging the queue of really old files. (I've also tested very short
"max_check_result_file", at the current setting I've minimized
flapping.)
Other checks that didn't improve the situation ...
-- Nice'd the nagios process to give highest priority possible.
Increased CPU load a little, but over time got the same idle
conditions after checks where complete.
-- Stretched out checks to > 15 minutes for critical services and > 2
hours for "nice to know about" services. Made queues fill up less
frequently.
-- Looked at disk performance and swapping. Neither system is
swapping nor does it have bottlenecks around disk issues.
With the purge routine, I won't see a file in the queue older than 90
minutes. Does this mean "max_check_result_file" isn't working? What
other parameters can I adjust? Anyone have any ideas of what's going
on?
Best,
Justin
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list