Confusion over current_state and last_hard_state in neb status callback
Ben
bench at silentmedia.com
Thu Nov 11 20:58:49 CET 2004
I'm capturing host status and service status callbacks in a neb module,
and I'm not really clear about the logic of how current_state and
last_hard_state get set. Hopefully somebody else is. Below are some table
snippets to show what I'm seeing.
The columns are, in order:
- the unique id of this service/host check,
- when this state started,
- the seconds the states remained unchanged (null when they're the current values),
- the soft_state,
- the last_hard_state,
- the current_attempt,
- the plugin_output.
Note that the current_attempt value gets updated in place when the states
don't change, instead of inserting a new row with a the same states but a
different current_attempt, as you might expect. Also, the plugin_output
value is the value at the start of the state, not after the most recent
attempt.
Clear as mud? Cool, here we go.....
484 | 2004-11-10 15:50:45-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 88.40 ms
This is pretty obvious and straightforward. A ping check succeeded on its
first try, and so the current_state is 0. It hasn't had any failures,
either, so the last_hard_state is also 0.
113 | 2004-11-10 15:06:59-08 | 29346 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 32.21 ms
113 | 2004-11-10 23:16:05-08 | 86 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 250.70 ms
113 | 2004-11-10 23:17:31-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 144.14 ms
Another simple example to verify foundations. We have a ping check that's
working fine for a long time, then blips with a warning, but 86 seconds
and another try later, we return to an ok state. We know there was only 1
try that resulted in a soft error state, because otherwise current_attempt
would have been greater than 1 on that middle row.
141 | 2004-11-10 15:32:44-08 | 54141 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.94 ms
141 | 2004-11-11 06:35:05-08 | 59 | 1 | 0 | 1 | PING WARNING - Packet loss = 0%, RTA = 334.69 ms
141 | 2004-11-11 06:36:04-08 | 10801 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 115.25 ms
141 | 2004-11-11 09:36:05-08 | 123 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 280.24 ms
141 | 2004-11-11 09:38:08-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 51.70 ms
Here's a ping that starts off work, blips with a warning, returns to a
working state, blips twice with a warning, then returns again to a working
state before max_attempts=3 is reached. Nothing special here.
Enough of the groundwork. Here's where the confusion starts:
5655 | 2004-11-10 15:03:30-08 | 58563 | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing - 0.041 second response time
5655 | 2004-11-11 07:19:33-08 | 553 | 2 | 0 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 07:28:46-08 | 8322 | 2 | 2 | 5 | Socket timeout after 30 seconds
5655 | 2004-11-11 09:47:28-08 | 0 | 0 | 2 | 5 | HTTP ok: HTTP/1.1 200 Channel Listing - 1.150 second response time
5655 | 2004-11-11 09:47:28-08 | | 0 | 0 | 1 | HTTP ok: HTTP/1.1 200 Channel Listing - 1.150 second response time
We start off with an http check in a good state. Then it enters a critical
state (2), and stays in that soft error state for 5 attempts. At that
point, it enters a hard critical state and last_hard_state also gets set
to 2. It's still in a currently having problems too, though, so
current_state also stays at 2. Then, at 9:47, it recovers, but somehow
manages to get 5 checks done in 0 seconds. That's my first point of
confusion. I would have thought that if soft_state was ok (0), then
regardless of last_hard_state, there would be no more attempts and the
service would recover. This might be a bug in nagios, where it's sending
the neb callback the wrong current_attempt number.
134 | 2004-11-10 15:06:59-08 | 51015 | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 65.07 ms
134 | 2004-11-11 05:17:14-08 | 132 | 1 | 0 | 2 | PING WARNING - Packet loss = 0%, RTA = 200.57 ms
134 | 2004-11-11 05:19:26-08 | 3615 | 1 | 1 | 3 | PING WARNING - Packet loss = 0%, RTA = 273.67 ms
134 | 2004-11-11 06:19:41-08 | | 0 | 0 | 1 | PING OK - Packet loss = 0%, RTA = 66.00 ms
The ping starts off fine, then enters a warning state for 2 attempts.
Even though max_attempts=3, it somehow doesn't log the 3rd attempt, and
instead goes right into a hard warning state (1) after 2 soft errors. I
don't understand why that would be. Anywayy, then it stays in the hard
warning state for a while, before magically jumping back to an ok state. I
would have expected the fourth row to have a {soft,hard} state pair of
{0,1}, but I don't see that. FWIW, a VAST majority of my records over the
last 2 days are like this.
94 | 2004-11-10 15:49:35-08 | 7593 | 1 | 1 | 3 | Warning - Non-critical error(s): account
94 | 2004-11-10 17:56:08-08 | 3595 | 3 | 3 | 3 | Unknown - unknown to redbull
94 | 2004-11-10 18:56:03-08 | | 1 | 1 | 3 | Warning - Non-critical error(s): account
Here's a service with max_attempts=3 that's been in a warning state for a
while. Then it switches to an unknown state for an hour before returning
to a warning state. As with the above situation, I don't see any
transitions - I would have expected {soft,hard} state pairs of {3,1} and
{1,3} to be sandwitching the middle row.
So, in writing this I think I've boiled my question down to this: Are the
supposed to be transition states or not? Is last_hard_state the last hard
state, or the current hard state? Given the name I imagine it's the
former, but if that's the case I would expect transitions, and most of the
time I'm not seeing that.
-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
More information about the Developers
mailing list