Some hard state changes missing in NDOUtils
Ton Voon
ton.voon at altinity.com
Tue Nov 13 14:24:26 CET 2007
Hi!
We've been doing some work to validate the data in NDOUtils and found
a bug in Nagios and a missing state change entry. This happens when a
service is in a failed state and changes to a different state at the
same time that the host is considered down (or unreachable).
DETAIL
These are the servicecheck results in the database:
mysql> select
start_time,state,state_type,output,current_check_attempt,max_check_attem
pts from nagios_servicechecks where service_object_id=445 and
start_time between '2007-11-05 13:40:00' and '2007-11-05 14:00:00';
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
| start_time | state | state_type |
output |
current_check_attempt | max_check_attempts |
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
| 2007-11-05 13:41:18 | 1 | 1 | DISK WARNING - free
space: / 1938 MB (10% inode=-): | 3
| 3 |
| 2007-11-05 13:46:18 | 1 | 1 | DISK WARNING - free
space: / 1939 MB (10% inode=-): | 3
| 3 |
| 2007-11-05 13:51:18 | 2 | 1 | CHECK_NRPE: Socket
timeout after 10 seconds. | 1
| 3 |
| 2007-11-05 13:56:18 | 1 | 0 | DISK WARNING - free
space: / 1939 MB (10% inode=-): | 1
| 3 |
| 2007-11-05 13:57:18 | 1 | 0 | DISK WARNING - free
space: / 1939 MB (10% inode=-): | 2
| 3 |
| 2007-11-05 13:58:39 | 0 | 1 | DISK OK - free space: /
2639 MB (14% inode=-): | 1
| 3 |
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
6 rows in set (0.02 sec)
Note that the current_check_attempt is 1/3 for the CRITICAL event at
13:51:18. This should be 3/3. A side effect of this is that the
subsequent warning at 13:56:18 is now considered a soft state when it
should remain as hard.
Looking at the state history table, we get:
mysql> select
state_time,state,state_type,output,current_check_attempt,max_check_attem
pts from nagios_statehistory where object_id=445 and state_time
between '2007-11-05 11:50:00' and '2007-11-05 14:00:00';
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
| state_time | state | state_type |
output |
current_check_attempt | max_check_attempts |
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
| 2007-11-05 11:51:05 | 1 | 1 | DISK WARNING - free
space: / 1902 MB (10% inode=-): | 3
| 3 |
| 2007-11-05 13:56:39 | 1 | 0 | DISK WARNING - free
space: / 1939 MB (10% inode=-): | 1
| 3 |
| 2007-11-05 13:57:19 | 1 | 0 | DISK WARNING - free
space: / 1939 MB (10% inode=-): | 2
| 3 |
| 2007-11-05 13:58:41 | 0 | 1 | DISK OK - free space: /
2639 MB (14% inode=-): | 3
| 3 |
+---------------------+-------+------------
+-----------------------------------------------------
+-----------------------+--------------------+
4 rows in set (0.00 sec)
Note that the state change from warn to critical at 13:51:18 has been
missed from here.
These are the relevant lines from nagios.log (the first just to show
that there were no interesting entries before 13:52:07):
Mon Nov 5 13:50:57 2007 SERVICE ALERT: unrelatedhost;TCP/
IP;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 100%
Mon Nov 5 13:52:07 2007 HOST ALERT: hostname;DOWN;SOFT;1;PING
CRITICAL - Packet loss = 100%
Mon Nov 5 13:52:17 2007 HOST ALERT: hostname;DOWN;SOFT;2;PING
CRITICAL - Packet loss = 100%
Mon Nov 5 13:52:37 2007 HOST ALERT: hostname;UNREACHABLE;HARD;3;PING
CRITICAL - Packet loss = 100%
Mon Nov 5 13:52:37 2007 SERVICE ALERT: hostname;/home;CRITICAL;HARD;
1;CHECK_NRPE: Socket timeout after 10 seconds.
Mon Nov 5 13:52:47 2007 SERVICE ALERT: hostname;/;CRITICAL;HARD;
1;CHECK_NRPE: Socket timeout after 10 seconds.
Mon Nov 5 13:53:56 2007 HOST ALERT: hostname;UP;HARD;1;PING OK -
Packet loss = 0%, RTA = 33.70 ms
Mon Nov 5 13:56:39 2007 SERVICE ALERT: hostname;/home;WARNING;SOFT;
1;DISK WARNING - free space: / 1939 MB (10% inode=-):
Mon Nov 5 13:56:39 2007 SERVICE ALERT: hostname;/;WARNING;SOFT;
1;DISK WARNING - free space: / 1939 MB (10% inode=-):
The log entries show that the service failure is due to a host down
state.
SOLUTION
Attached is a patch that works for us by changing a few lines in
checks.c.
We managed to recreate this manually on a pre-patched system and saw
the same behaviour. After the patch was applied, the state change
from warn to critical was correctly added into the state history
table, and the subsequent return to warn was also added as a hard state.
This is on Nagios 2.9 (with Altinity patches) & NDOUtils 1.4b3.
Does this seem valid?
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios_host_failures_cause_incorrect_service_states.patch
Type: application/octet-stream
Size: 1930 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20071113/ffac26c7/attachment.obj>
-------------- next part --------------
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list