Some hard state changes missing in NDOUtils
Andreas Ericsson
ae at op5.se
Tue Nov 13 17:13:33 CET 2007
Ton Voon wrote:
> Hi!
>
> We've been doing some work to validate the data in NDOUtils and found a
> bug in Nagios and a missing state change entry. This happens when a
> service is in a failed state and changes to a different state at the
> same time that the host is considered down (or unreachable).
>
> DETAIL
>
> These are the servicecheck results in the database:
>
> mysql> select
> start_time,state,state_type,output,current_check_attempt,max_check_attempts
> from nagios_servicechecks where service_object_id=445 and start_time
> between '2007-11-05 13:40:00' and '2007-11-05 14:00:00';
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
> | start_time | state | state_type |
> output |
> current_check_attempt | max_check_attempts |
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
> | 2007-11-05 13:41:18 | 1 | 1 | DISK WARNING - free space:
> / 1938 MB (10% inode=-): | 3 | 3 |
> | 2007-11-05 13:46:18 | 1 | 1 | DISK WARNING - free space:
> / 1939 MB (10% inode=-): | 3 | 3 |
> | 2007-11-05 13:51:18 | 2 | 1 | CHECK_NRPE: Socket timeout
> after 10 seconds. | 1 | 3 |
> | 2007-11-05 13:56:18 | 1 | 0 | DISK WARNING - free space:
> / 1939 MB (10% inode=-): | 1 | 3 |
> | 2007-11-05 13:57:18 | 1 | 0 | DISK WARNING - free space:
> / 1939 MB (10% inode=-): | 2 | 3 |
> | 2007-11-05 13:58:39 | 0 | 1 | DISK OK - free space: /
> 2639 MB (14% inode=-): | 1 | 3 |
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
>
> 6 rows in set (0.02 sec)
>
>
> Note that the current_check_attempt is 1/3 for the CRITICAL event at
> 13:51:18. This should be 3/3. A side effect of this is that the
> subsequent warning at 13:56:18 is now considered a soft state when it
> should remain as hard.
>
>
> Looking at the state history table, we get:
>
> mysql> select
> state_time,state,state_type,output,current_check_attempt,max_check_attempts
> from nagios_statehistory where object_id=445 and state_time between
> '2007-11-05 11:50:00' and '2007-11-05 14:00:00';
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
> | state_time | state | state_type |
> output |
> current_check_attempt | max_check_attempts |
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
> | 2007-11-05 11:51:05 | 1 | 1 | DISK WARNING - free space:
> / 1902 MB (10% inode=-): | 3 | 3 |
> | 2007-11-05 13:56:39 | 1 | 0 | DISK WARNING - free space:
> / 1939 MB (10% inode=-): | 1 | 3 |
> | 2007-11-05 13:57:19 | 1 | 0 | DISK WARNING - free space:
> / 1939 MB (10% inode=-): | 2 | 3 |
> | 2007-11-05 13:58:41 | 0 | 1 | DISK OK - free space: /
> 2639 MB (14% inode=-): | 3 | 3 |
> +---------------------+-------+------------+-----------------------------------------------------+-----------------------+--------------------+
>
> 4 rows in set (0.00 sec)
>
>
> Note that the state change from warn to critical at 13:51:18 has been
> missed from here.
>
> These are the relevant lines from nagios.log (the first just to show
> that there were no interesting entries before 13:52:07):
>
> Mon Nov 5 13:50:57 2007 SERVICE ALERT:
> unrelatedhost;TCP/IP;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 100%
> Mon Nov 5 13:52:07 2007 HOST ALERT: hostname;DOWN;SOFT;1;PING CRITICAL
> - Packet loss = 100%
> Mon Nov 5 13:52:17 2007 HOST ALERT: hostname;DOWN;SOFT;2;PING CRITICAL
> - Packet loss = 100%
> Mon Nov 5 13:52:37 2007 HOST ALERT: hostname;UNREACHABLE;HARD;3;PING
> CRITICAL - Packet loss = 100%
> Mon Nov 5 13:52:37 2007 SERVICE ALERT:
> hostname;/home;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
> Mon Nov 5 13:52:47 2007 SERVICE ALERT:
> hostname;/;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
> Mon Nov 5 13:53:56 2007 HOST ALERT: hostname;UP;HARD;1;PING OK - Packet
> loss = 0%, RTA = 33.70 ms
> Mon Nov 5 13:56:39 2007 SERVICE ALERT:
> hostname;/home;WARNING;SOFT;1;DISK WARNING - free space: / 1939 MB (10%
> inode=-):
> Mon Nov 5 13:56:39 2007 SERVICE ALERT: hostname;/;WARNING;SOFT;1;DISK
> WARNING - free space: / 1939 MB (10% inode=-):
>
> The log entries show that the service failure is due to a host down state.
>
Yes, but it also shows that the services swap to STATE_CRITICAL during that
time, which correctly resets the current_attempt (since it's a new state).
>
> SOLUTION
>
> Attached is a patch that works for us by changing a few lines in checks.c.
>
> We managed to recreate this manually on a pre-patched system and saw the
> same behaviour. After the patch was applied, the state change from warn
> to critical was correctly added into the state history table, and the
> subsequent return to warn was also added as a hard state.
>
> This is on Nagios 2.9 (with Altinity patches) & NDOUtils 1.4b3.
>
> Does this seem valid?
>
Not really, no. What you describe as a bug (state changes aren't logged),
you fix by touching a much higher logic. The fact that current_attempt
is reset upon state change is about as old as nagios itself, and since
"the right thing to do" there is subject to personal opinion, I might
as well chime in with mine ;-)
It would be best to follow the path of the least surprise. In this case,
I think this simple rule describes that path rather well:
"Whenever a state change occurs, the check attempt value is reset."
Besides that, I'm curious to know how this changes notification behaviour.
Will contacts that want to be notified only for critical services still
get a notification sent to them even if the current_attempt is not reset
and the service was in warning state prior to going critical? If not, your
patch is incomplete. If yes, it would be nice with some confirmation in
the bug report. For the record, I'd still think this is the wrong "fix",
as it changes something completely orthogonal and simply hides the real
bug.
On a side note, I'm a little unclear about what you're actually reporting
as the bug. The fact that obj->current_attempt is reset, or the fact that
state entries are missing from the NDOUtils table. The report seems to
imply both, while common sense suggests the latter and the patch amends
the former.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list