Child host becomes UNREACHABLE when parent changes from UP to a SOFT DOWN state
Aidan Anderson
mail at aidananderson.co.uk
Wed Apr 7 12:16:25 CEST 2010
Hi List!
I am in the process of upgrading from v2.12 to v3.2.1. As well as
upgrading, I am taking the opportunity to move to a new server at the
same time. This has allowed me to run both versions in tandem to
compare the operation of the two versions.
One difference I noticed straight away was downtime duration on certain
hosts. For example, v2 would show a host down for over 2 days yet v3
would show the same host as being down for only a few hours. On
investigation, it turned out that the parent of the host on v3 went into
a soft down state. This changed the host in question to an unreachable
state. The parent host recovered within a minute or so and changed the
host back to a down state, effectively resetting the down duration back
to zero. I would have expected that the child host should only change
state if the parent goes into a hard down state, not a soft down state.
I googled for the issue and found one related post from just over a year
ago:
http://www.mail-archive.com/nagios-users@lists.sourceforge.net/msg25543.html
The poster was given various suggestions to circumvent the problem, i.e.
tweaking flap detection, increasing time-out on the plugin etc but
nothing that seemed to resolve his issue.
The posters main problem with this behaviour was that he was getting
down e-mail alerts for hosts that are already down due to the state
changes. My issue is not with repeated alerts but with the accuracy of
the down duration of the host. When our support department look to
resolve host problems, they will try and resolve the oldest problems
first for obvious reasons of fairness to our customers. This scenario
breaks this. In v3, to get an accurate downtime for a host, you would
now have to trawl through the alert history or run a trend report for
the host to find out when the host really went down.
Version 2 does not exhibit this problem. I don't think this is by
design but purely down to the way serial host checks work in v2. When a
host goes into a soft down state in v2, Nagios cannot do anything else
until it has completed all the retries or the host recovers so Nagios
never gets the chance to mark the child host unreachable unless it
reaches max_check_attempts and determines that the parent host really is
down.
The original poster of this problem made a good point that Nagios has
all the tolerance built in to avoid false alarms on host checks but
unfortunately this logic doesn't carry on through child hosts.
I can't see that the current way v3 deals with parent/child problems as
being desirable for most people, although it seems to have only bothered
2 of us!
Thoughts?
regards,
Aidan
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list