Children "unreachable" on soft down?
Israel Brewster
israel at frontierflying.com
Mon Apr 6 20:55:33 CEST 2009
On Apr 6, 2009, at 9:03 AM, Giorgio Zarrelli wrote:
> Hi,
>
> I've not quite clear what happens to you,
Thanks for the response. For clarification, the exact sequence of
events is as follows:
1) The link between the nagios box and one of our routers, which we
will refer to as the parent host glitches for 30 seconds or so. Due to
the nature of the link (satellite connection) this is semi-expected,
and happens a couple of times a day.
2) Nagios catches this glitch in one of it's regularly scheduled host
checks, and puts the parent host into a soft down state. Again, normal
and expected - even good.
3) At the same time, Nagios puts the children of the parent host into
an "unreachable" state. Makes sense, at least, but leads to the issue
4) The parent host is now in recheck mode (as it is only in a soft
down state and has three rechecks set), so it checks again a minute
later. This check succeeds, as the outage was transitory. The parent
host is put back into an "UP" state. As it never was in a hard "down"
state, no notification is sent. This is good.
5) since the parent is now up, the child host now is changed to a
(soft I think) "down" state.
6) check continue on a normal schedule. As the link does not glitch
again for several hours, parent remains up and child remains
(correctly) down. Three checks later, child enters a hard "down" state
(since it was unreachable and only just switched back to down). Down
notification is sent for child.
7) Everything remains good for the next several hours until the link
glitches again. Repeat from step one.
The notification in step 6 is the problem here - the child host was
down before the glitch, the child host is still down after. But
because the child host was temporarily put in an unreachable state, we
get notified again that it is down, resulting in a string of "DOWN"
messages with no up or real change in status.
> but one thing I have in mind is try
>
> soft_state_dependencies=0
>
> Besides that, the problems seems to be in the roots of the check.
> It's not
> healty to have a ping check failing every 2 strikes. Try to change
> the host
> alive check, using a ssh check instead.
The check is not failing every 2 strikes. It's failing once, briefly,
every few hours - just barely long enough to make one check fail and
throw the parent host into a soft down state. The first recheck (one
minute later) works fine, bringing the parent back to an up state. The
next several hundred or more checks also work fine (as the problem was
transitory and brief). For this reason, changing the check wouldn't
help - for the duration of that single check, the host really is down
(or more precisely, unreachable, as it is a link issue), and any check
I used would say so.
> Another approach, not so useful, would be to increase the timeout
> for the ping
> (-W) so it will have less chances to fail.
except that it's not a timeout issue. It is a very real, albeit brief
(around 30 seconds or so), outage. Not long enough or frequent enough
to really impact productivity or anything, but long enough for nagios
to catch it (for a single check).
-----------------------------------------------
Israel Brewster
Computer Support Technician II
Frontier Flying Service Inc.
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7250 x293
-----------------------------------------------
>
> Giorgio
>
> Israel Brewster (israel at frontierflying.com) scritto:
>>
>> So does anyone have any ideas as to how I can resolve this situation?
>> It continues to be an annoyance. Thanks.
>>
>> -----------------------------------------------
>> Israel Brewster
>> Computer Support Technician II
>> Frontier Flying Service Inc.
>> 5245 Airport Industrial Rd
>> Fairbanks, AK 99709
>> (907) 450-7250 x293
>> -----------------------------------------------
>>
>>
>>
>> On Mar 31, 2009, at 8:17 AM, Israel Brewster wrote:
>>
>>> On Mar 31, 2009, at 1:09 AM, Andreas Ericsson wrote:
>>>
>>>> Israel Brewster wrote:
>>>>> Does nagios (3.0.3) mark a child host as unreachable when its
>>>>> parent enters a soft down state? I am finding myself getting
>>>>> repeated down messages for a host (which is, in fact, down), even
>>>>> though I have notifications set to only send a single message.
>>>>> Looking at the logs, it would appear that what is happening is
>>>>> that the host is flipping between "down" (which notifies me) and
>>>>> "unreachable" (which does not). The parent host, however, never
>>>>> enters a hard down state. Looking at the logs, what I see is that
>>>>> one ICMP check fails, throwing the host into a soft down state,
>>>>> but the next one works just fine, bringing it back to an up
>>>>> state.
>>>>> The logic works fine for the parent host- since it never hits a
>>>>> hard down state, it doesn't alert, and everyone is happy. But
>>>>> apparently with the child host every time this happens, it
>>>>> switches from critical to unreachable and back again,
>>>>> triggering a
>>>>> notification. Is there any way to keep this from happening?
>>>>> Thanks.
>>>>
>>>> Doesn't flapping detection do what you want? You'd get a few
>>>> notifications, but they'd stop after the 3rd flip or something, I
>>>> think.
>>>
>>> Flapping detection helps, but doesn't solve. For one thing, as you
>>> mentioned, you still get at least a couple of notifications before
>>> it
>>> kicks in. For another thing, this happens with a frequency of
>>> something like once an hour or so (not consistently), so the host
>>> will
>>> flip from down to unreachable and back again, triggering an e-mail,
>>> perhaps do it a second time, and then it will sit in the correct
>>> "down" state for the next 50 checks or so (thus canceling any
>>> flapping
>>> detection) before repeating the process. It's not like I'm getting
>>> messages every five minutes or anything, it's just that I'm getting
>>> repeated down messages every hour or two for hosts that have been
>>> down
>>> and haven't actually changed state.
>>>
>>> I could, of course, schedule down time, except that I want to be
>>> notified if/when the people in the remote station get their act
>>> together and get the machine(s) in question back online. Also that
>>> is
>>> only partially effective for machines that have been sent in for
>>> repair, because I don't really know when the scheduled down time
>>> will
>>> be over. They are down, I know they are down, I just don't want to
>>> be
>>> told about it every few hours :-)
>>>
>>> -----------------------------------------------
>>> Israel Brewster
>>> Computer Support Technician II
>>> Frontier Flying Service Inc.
>>> 5245 Airport Industrial Rd
>>> Fairbanks, AK 99709
>>> (907) 450-7250 x293
>>> -----------------------------------------------
>>>
>>>>
>>>>
>>>> --
>>>> Andreas Ericsson andreas.ericsson at op5.se
>>>> OP5 AB www.op5.se
>>>> Tel: +46 8-230225 Fax: +46 8-230231
>>>>
>>>> Considering the successes of the wars on alcohol, poverty, drugs
>>>> and
>>>> terror, I think we should give some serious thought to declaring
>>>> war
>>>> on peace.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Nagios-users mailing list
>>> Nagios-users at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>>> ::: Please include Nagios version, plugin version (-v) and OS when
>>> reporting any issue.
>>> ::: Messages without supporting info will risk being sent to /dev/
>>> null
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when
>> reporting any issue.
>> ::: Messages without supporting info will risk being sent to /dev/
>> null
>>
>
------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list