Some hard state changes missing in NDOUtils
Ton Voon
ton.voon at altinity.com
Tue Nov 13 18:32:36 CET 2007
Andreas,
On 13 Nov 2007, at 16:13, Andreas Ericsson wrote:
> It would be best to follow the path of the least surprise. In this
> case,
> I think this simple rule describes that path rather well:
> "Whenever a state change occurs, the check attempt value is reset."
I think you are confusing a state change from OK to a failed state,
when I have clearly said that this is a state change from a failed
state to a different failed state at the same time that a host is
discovered to be down/unreachable.
http://nagios.sourceforge.net/docs/2_0/statetypes.html
...says that a hard state change occurs when it goes from a "hard non-
OK state of some kind to a hard non-OK state of another kind (i.e.
from a hard WARNING state to a hard UNKNOWN state)". I assert that
this does not happen in this particular case.
There is nothing on that documentation page about check attempt
values between hard states, but I also suggest that if something is
on check attempt 3 out of max attempts 3 in a warning state, then if
the next result is critical, the check attempt should remain at 3.
I've just tested this using passive checks and this appears to be
true, so your assumption that "whenever a state change occurs, the
check attempt value is reset" is not current Nagios behaviour.
> Besides that, I'm curious to know how this changes notification
> behaviour.
Notification logic has not been touched. The fix for the hard state
change is the call to handle_service_event() (which in turn calls
event handlers as well as update NDO).
Normally, every call to log_service_event() (which puts the
nagios.log entry in) is followed by a handle_service_event(). But
this is missing in this scenario, which is how I reached this
conclusion.
> On a side note, I'm a little unclear about what you're actually
> reporting
> as the bug. The fact that obj->current_attempt is reset, or the
> fact that
> state entries are missing from the NDOUtils table. The report seems to
> imply both, while common sense suggests the latter and the patch
> amends
> the former.
Apologies. Looking back, my summary alluded to two bugs, but I failed
to fully detail them. For the record there are two bugs:
1) check_attempts is reset incorrectly when a service is currently
in HARD state and the host has just failed
2) the event handlers are not called in this scenario, thus a
record is not propagated to NDO
Thinking about (1) a bit more, it is possible that a service in a
soft error state would show a hard error with check attempts 1/3,
which is counter-intuitive. However, this is the case mentioned in
http://nagios.sourceforge.net/docs/2_0/statetypes.html where "When a
service check results in a non-OK state and its corresponding host is
either DOWN or UNREACHABLE [causes a hard state change]. This is an
exception to the general monitoring logic, but makes perfect sense.
If the host isn't up why should we try and recheck the service?".
However, I can see that there may still be problems with my fix to
(1), so peer review is welcome.
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list