Confusion over current_state and last_hard_state in neb status callback

Ben bench at silentmedia.com
Thu Nov 11 20:58:49 CET 2004
Previous message: patch: safe reload
Next message: little macro patch
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I'm capturing host status and service status callbacks in a neb module,
and I'm not really clear about the logic of how current_state and
last_hard_state get set. Hopefully somebody else is. Below are some table
snippets to show what I'm seeing.

The columns are, in order: 
- the unique id of this service/host check, 
- when this state started,
- the seconds the states remained unchanged (null when they're the current values), 
- the soft_state, 
- the last_hard_state, 
- the current_attempt, 
- the plugin_output. 

Note that the current_attempt value gets updated in place when the states 
don't change, instead of inserting a new row with a the same states but a 
different current_attempt, as you might expect. Also, the plugin_output 
value is the value at the start of the state, not after the most recent 
attempt.

Clear as mud?  Cool, here we go.....




 484 | 2004-11-10 15:50:45-08 |       | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 88.40 ms

This is pretty obvious and straightforward. A ping check succeeded on its 
first try, and so the current_state is 0. It hasn't had any failures, 
either, so the last_hard_state is also 0.


 113 | 2004-11-10 15:06:59-08 | 29346 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 32.21 ms
 113 | 2004-11-10 23:16:05-08 |    86 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 250.70 ms
 113 | 2004-11-10 23:17:31-08 |       | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 144.14 ms

Another simple example to verify foundations. We have a ping check that's
working fine for a long time, then blips with a warning, but 86 seconds
and another try later, we return to an ok state. We know there was only 1
try that resulted in a soft error state, because otherwise current_attempt
would have been greater than 1 on that middle row.


 141 | 2004-11-10 15:32:44-08 | 54141 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.94 ms
 141 | 2004-11-11 06:35:05-08 |    59 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 334.69 ms
 141 | 2004-11-11 06:36:04-08 | 10801 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 115.25 ms
 141 | 2004-11-11 09:36:05-08 |   123 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 280.24 ms
 141 | 2004-11-11 09:38:08-08 |       | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.70 ms

Here's a ping that starts off work, blips with a warning, returns to a 
working state, blips twice with a warning, then returns again to a working 
state before max_attempts=3 is reached. Nothing special here.

Enough of the groundwork. Here's where the confusion starts:


5655 | 2004-11-10 15:03:30-08 | 58563 | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing -   0.041 second response time
5655 | 2004-11-11 07:19:33-08 |   553 | 2 | 0 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 07:28:46-08 |  8322 | 2 | 2 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 09:47:28-08 |     0 | 0 | 2 | 5 | HTTP ok: HTTP/1.1 200 Channel Listing -   1.150 second response time
5655 | 2004-11-11 09:47:28-08 |       | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing -   1.150 second response time

We start off with an http check in a good state. Then it enters a critical 
state (2), and stays in that soft error state for 5 attempts. At that 
point, it enters a hard critical state and last_hard_state also gets set 
to 2. It's still in a currently having problems too, though, so 
current_state also stays at 2. Then, at 9:47, it recovers, but somehow 
manages to get 5 checks done in 0 seconds. That's my first point of 
confusion. I would have thought that if soft_state was ok (0), then 
regardless of last_hard_state, there would be no more attempts and the 
service would recover. This might be a bug in nagios, where it's sending 
the neb callback the wrong current_attempt number.


 134 | 2004-11-10 15:06:59-08 | 51015 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 65.07 ms
 134 | 2004-11-11 05:17:14-08 |   132 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 200.57 ms
 134 | 2004-11-11 05:19:26-08 |  3615 | 1 | 1 | 3 | PING WARNING - Packet loss = 0%, RTA = 273.67 ms
 134 | 2004-11-11 06:19:41-08 |       | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 66.00 ms

The ping starts off fine, then enters a warning state for 2 attempts. 
Even though max_attempts=3, it somehow doesn't log the 3rd attempt, and 
instead goes right into a hard warning state (1) after 2 soft errors. I 
don't understand why that would be. Anywayy, then it stays in the hard 
warning state for a while, before magically jumping back to an ok state. I 
would have expected the fourth row to have a {soft,hard} state pair of 
{0,1}, but I don't see that. FWIW, a VAST majority of my records over the 
last 2 days are like this.


  94 | 2004-11-10 15:49:35-08 |  7593 | 1 | 1 | 3 | Warning - Non-critical error(s): account
  94 | 2004-11-10 17:56:08-08 |  3595 | 3 | 3 | 3 | Unknown - unknown to redbull
  94 | 2004-11-10 18:56:03-08 |       | 1 | 1 | 3 | Warning - Non-critical error(s): account

Here's a service with max_attempts=3 that's been in a warning state for a
while. Then it switches to an unknown state for an hour before returning
to a warning state. As with the above situation, I don't see any 
transitions - I would have expected {soft,hard} state pairs of {3,1} and 
{1,3} to be sandwitching the middle row.



So, in writing this I think I've boiled my question down to this: Are the 
supposed to be transition states or not? Is last_hard_state the last hard 
state, or the current hard state? Given the name I imagine it's the 
former, but if that's the case I would expect transitions, and most of the 
time I'm not seeing that. 



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
Previous message: patch: safe reload
Next message: little macro patch
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list