nagiostats Bug with Active Service Checks
tanner
tanner at linuxbox.com
Mon Feb 23 23:38:03 CET 2009
Hello,
During the course of a recent distributed deployment, I discovered a bug
in nagiostats (and possibly Nagios) that lead to misleading statistics
in certain situations.
In particular, I set things up so that every distributed server knew
about all of the service checks, but inherited several properties
(active_checks_enabled, notifications, etc) from a single configuration
file that was unique on each Nagios server. After initially loading up a
single monitoring host with a couple thousand service checks, I shuffled
them out to the other distributed hosts. This led to nagiostats
reporting insane numbers for the active check latency of the initially
loaded up host but realistic numbers for the other ones.
It appears that nagiostats uses check_type to determine whether to
process a service as though it is active, rather than
active_checks_enabled. This may well be fine if Nagios correctly reset
check_type after a configuration reload, but it doesn't appear to change
it.
It looked like, as I changed services to active_checks_enabled = 0, the
active service latency average went higher and higher. Looking in
status.dat, the recently disabled services (which, by the by, still had
an active check scheduled when they were switched to
active_checks_enabled=0) would eventually time out and have a massive
latency, which would be averaged in with the rest of the latencies.
This was specifically with Nagios 3.0.6, my apologies if this has been
fixed since the latest stable release.
The attached patch may be the correct answer or is may be a work around
for Nagios only setting check_type the first time a service is created
in status.dat. Either way, it was the quickest way for me to get more
accurate latency information, so I thought I'd share it along with the bug.
Feel free to let me know if there's any questions or if my diagnosis was
entirely wrong.
Thanks,
Tanner
--
Tanner Beck
The Linux Box
734.761.4689
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagiostats-active-service-check-check.diff
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090223/c785195f/attachment.ksh>
-------------- next part --------------
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list