Children "unreachable" on soft down?

Israel Brewster israel at frontierflying.com
Wed Apr 8 18:44:05 CEST 2009


So is this just something I'll have to live with? I don't seem to be  
getting much feedback on the subject. :(
-----------------------------------------------
Israel Brewster
Computer Support Technician II
Frontier Flying Service Inc.
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7250 x293
-----------------------------------------------



On Apr 6, 2009, at 10:55 AM, Israel Brewster wrote:

> On Apr 6, 2009, at 9:03 AM, Giorgio Zarrelli wrote:
>
>> Hi,
>>
>> I've not quite clear what happens to you,
>
> Thanks for the response. For clarification, the exact sequence of
> events is as follows:
>
> 1) The link between the nagios box and one of our routers, which we
> will refer to as the parent host glitches for 30 seconds or so. Due to
> the nature of the link (satellite connection) this is semi-expected,
> and happens a couple of times a day.
>
> 2) Nagios catches this glitch in one of it's regularly scheduled host
> checks, and puts the parent host into a soft down state. Again, normal
> and expected - even good.
>
> 3) At the same time, Nagios puts the children of the parent host into
> an "unreachable" state. Makes sense, at least, but leads to the issue
>
> 4) The parent host is now in recheck mode (as it is only in a soft
> down state and has three rechecks set), so it checks again a minute
> later. This check succeeds, as the outage was transitory. The parent
> host is put back into an "UP" state. As it never was in a hard "down"
> state, no notification is sent. This is good.
>
> 5) since the parent is now up, the child host now is changed to a
> (soft I think) "down" state.
>
> 6) check continue on a normal schedule. As the link does not glitch
> again for several hours, parent remains up and child remains
> (correctly) down. Three checks later, child enters a hard "down" state
> (since it was unreachable and only just switched back to down). Down
> notification is sent for child.
>
> 7) Everything remains good for the next several hours until the link
> glitches again. Repeat from step one.
>
> The notification in step 6 is the problem here - the child host was
> down before the glitch, the child host is still down after. But
> because the child host was temporarily put in an unreachable state, we
> get notified again that it is down, resulting in a string of "DOWN"
> messages with no up or real change in status.
>
>> but one thing I have in mind is try
>>
>> soft_state_dependencies=0
>>
>> Besides that, the problems seems to be in the roots of the check.
>> It's not
>> healty to have a ping check failing every 2 strikes. Try to change
>> the host
>> alive check, using a ssh check instead.
>
> The check is not failing every 2 strikes. It's failing once, briefly,
> every few hours - just barely long enough to make one check fail and
> throw the parent host into a soft down state. The first recheck (one
> minute later) works fine, bringing the parent back to an up state. The
> next several hundred or more checks also work fine (as the problem was
> transitory and brief). For this reason, changing the check wouldn't
> help - for the duration of that single check, the host really is down
> (or more precisely, unreachable, as it is a link issue), and any check
> I used would say so.
>
>> Another approach, not so useful, would be to increase the timeout
>> for the ping
>> (-W) so it will have less chances to fail.
>
> except that it's not a timeout issue. It is a very real, albeit brief
> (around 30 seconds or so), outage. Not long enough or frequent enough
> to really impact productivity or anything, but long enough for nagios
> to catch it (for a single check).
>
> -----------------------------------------------
> Israel Brewster
> Computer Support Technician II
> Frontier Flying Service Inc.
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7250 x293
> -----------------------------------------------
>>
>> Giorgio
>>
>> Israel Brewster (israel at frontierflying.com) scritto:
>>>
>>> So does anyone have any ideas as to how I can resolve this  
>>> situation?
>>> It continues to be an annoyance. Thanks.
>>>
>>> -----------------------------------------------
>>> Israel Brewster
>>> Computer Support Technician II
>>> Frontier Flying Service Inc.
>>> 5245 Airport Industrial Rd
>>> Fairbanks, AK 99709
>>> (907) 450-7250 x293
>>> -----------------------------------------------
>>>
>>>
>>>
>>> On Mar 31, 2009, at 8:17 AM, Israel Brewster wrote:
>>>
>>>> On Mar 31, 2009, at 1:09 AM, Andreas Ericsson wrote:
>>>>
>>>>> Israel Brewster wrote:
>>>>>> Does nagios (3.0.3) mark a child host as unreachable when its
>>>>>> parent  enters a soft down state? I am finding myself getting
>>>>>> repeated down  messages for a host (which is, in fact, down),  
>>>>>> even
>>>>>> though I have  notifications set to only send a single message.
>>>>>> Looking at the logs,  it would appear that what is happening is
>>>>>> that the host is flipping  between "down" (which notifies me) and
>>>>>> "unreachable" (which does not).  The parent host, however, never
>>>>>> enters a hard down state. Looking at  the logs, what I see is  
>>>>>> that
>>>>>> one ICMP check fails, throwing the host  into a soft down state,
>>>>>> but the next one works just fine, bringing it  back to an up
>>>>>> state.
>>>>>> The logic works fine for the parent host- since it never hits a
>>>>>> hard  down state, it doesn't alert, and everyone is happy. But
>>>>>> apparently  with the child host every time this happens, it
>>>>>> switches from critical  to unreachable and back again,
>>>>>> triggering a
>>>>>> notification. Is there any  way to keep this from happening?
>>>>>> Thanks.
>>>>>
>>>>> Doesn't flapping detection do what you want? You'd get a few
>>>>> notifications, but they'd stop after the 3rd flip or something, I
>>>>> think.
>>>>
>>>> Flapping detection helps, but doesn't solve. For one thing, as you
>>>> mentioned, you still get at least a couple of notifications before
>>>> it
>>>> kicks in. For another thing, this happens with a frequency of
>>>> something like once an hour or so (not consistently), so the host
>>>> will
>>>> flip from down to unreachable and back again, triggering an e-mail,
>>>> perhaps do it a second time, and then it will sit in the correct
>>>> "down" state for the next 50 checks or so (thus canceling any
>>>> flapping
>>>> detection) before repeating the process. It's not like I'm getting
>>>> messages every five minutes or anything, it's just that I'm getting
>>>> repeated down messages every hour or two for hosts that have been
>>>> down
>>>> and haven't actually changed state.
>>>>
>>>> I could, of course, schedule down time, except that I want to be
>>>> notified if/when the people in the remote station get their act
>>>> together and get the machine(s) in question back online. Also that
>>>> is
>>>> only partially effective for machines that have been sent in for
>>>> repair, because I don't really know when the scheduled down time
>>>> will
>>>> be over. They are down, I know they are down, I just don't want to
>>>> be
>>>> told about it every few hours :-)
>>>>
>>>> -----------------------------------------------
>>>> Israel Brewster
>>>> Computer Support Technician II
>>>> Frontier Flying Service Inc.
>>>> 5245 Airport Industrial Rd
>>>> Fairbanks, AK 99709
>>>> (907) 450-7250 x293
>>>> -----------------------------------------------
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Andreas Ericsson                   andreas.ericsson at op5.se
>>>>> OP5 AB                             www.op5.se
>>>>> Tel: +46 8-230225                  Fax: +46 8-230231
>>>>>
>>>>> Considering the successes of the wars on alcohol, poverty, drugs
>>>>> and
>>>>> terror, I think we should give some serious thought to declaring
>>>>> war
>>>>> on peace.
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Nagios-users mailing list
>>>> Nagios-users at lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>>>> ::: Please include Nagios version, plugin version (-v) and OS when
>>>> reporting any issue.
>>>> ::: Messages without supporting info will risk being sent to /dev/
>>>> null
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Nagios-users mailing list
>>> Nagios-users at lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>>> ::: Please include Nagios version, plugin version (-v) and OS when
>>> reporting any issue.
>>> ::: Messages without supporting info will risk being sent to /dev/
>>> null
>>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> High Quality Requirements in a Collaborative Environment.
> Download a free trial of Rational Requirements Composer Now!
> http://p.sf.net/sfu/www-ibm-com
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when  
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null


------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list