passive host checks status instability, bug or configuration error?
Artur D'Assumpção
artur.dassumpcao at di.com.pt
Sun Apr 10 10:29:51 CEST 2005
I guess I was wrong... I left this workaround testing trought the night
with notifications on and today i've received a few status changes
notifications. So, this isn't good working too.
AD
Artur D'Assumpção wrote:
> I am still having the same problem in the same conditions, but I think
> I've found a workaround that can help you debug this issue:
>
> I've changed the service-is-stale with another plugin that returns 1
> of 2 states possible depending on the $HOSTSTATE$ macro:
>
> sr-0 plugins-di # cat service_is_stale
> if [ $1 == "DOWN" ]; then
> /usr/nagios/libexec/check_dummy 2 "Service results not received"
> elif [ $1 == "UP" ]; then
> /usr/nagios/libexec/check_dummy 3 "Service results not received"
> fi
>
>
> --
>
> # default command used when nsca results for a given host's service
> wheren't received
> define command {
> command_name service-is-stale
> command_line $USER2$/service_is_stale $HOSTSTATE$
> }
>
> Now, when he service checks get staled the status returned depends the
> $HOSTSTATE$ macro. In this specific case, why the host status is DOWN
> the returned value for the staled services is CRITICAL, leading to a
> not change of the host status.
>
> Anyway, I still have the same question, isn't supposed to ignore all
> the service checks if the host is stated DOWN?
>
> AD
>
>
>
> Artur D'Assumpção wrote:
>
>> Hi ppl,
>>
>> I'm having very strange instable results in passive host checks, I
>> don't know if i've found a bug or if I am actually doing something
>> wrong here.
>>
>> I'll try to introduce the network first, before exposing the actual
>> problem:
>>
>> Well, I have some hosts that are behind firewalled networks, so
>> service and host checks have to be submited passively using send_nsca.
>>
>> In the main config I have these refresh options:
>>
>> check_service_freshness=1
>> check_host_freshness=1
>>
>> service_freshness_check_interval=300
>> host_freshness_check_interval=60
>>
>> retain status options are disabled also.
>>
>> A generic host in these conditions uses this template configuration,
>>
>> define host {
>> name generic-passive-unreachable-host
>>
>> active_checks_enabled 0
>> passive_checks_enabled 1
>>
>> obsess_over_host 1
>> event_handler_enabled 0
>> flap_detection_enabled 0
>> process_perf_data 0
>> retain_status_information 0
>> retain_nonstatus_information 0
>>
>> check_command host-is-stale
>> check_freshness 1
>> freshness_threshold 120
>> max_check_attempts 1
>>
>> notifications_enabled 1
>> notification_interval 60
>> notification_period 24x7
>> notification_options d,u,r
>>
>> contact_groups dummy-contacts
>>
>> register 0
>> }
>>
>> analogous for services:
>>
>> define service {
>> name generic-passive-service
>>
>> active_checks_enabled 0
>> passive_checks_enabled 1
>>
>> obsess_over_service 1
>> event_handler_enabled 0
>> flap_detection_enabled 0
>> process_perf_data 1
>> retain_status_information 0
>> retain_nonstatus_information 0
>> is_volatile 0
>>
>> check_command service-is-stale
>> check_freshness 1
>> freshness_threshold 300
>> parallelize_check 1
>> check_period 24x7
>> max_check_attempts 2
>> normal_check_interval 5
>> retry_check_interval 5
>>
>> notifications_enabled 1
>> notification_interval 60
>> notification_period 24x7
>> notification_options c,r
>>
>> contact_groups dummy-contacts
>>
>> register 0
>> }
>>
>>
>> Now to the real problem. I'm having problems with the host status
>> flapping from UP to DOWN constantly. In my tests I have only the
>> monitoring server up, the other clients/servers are down. Everytime
>> the host threshold expires the 'host-is-stale' get run, returning
>> allways a DOWN state:
>>
>> Apr 9 17:52:09 sr-0 nagios: Warning: The results of host
>> 'domain.pt_sfci-dr-0' are stale by 60 seconds (threshold=120
>> seconds). I'm forcing an immediate check of the host.
>>
>> This is the expected behavior, so far so good...
>>
>> The problem starts happening when I see that this host related
>> passive services threshold is also expiring, even when the host is in
>> status DOWN:
>>
>> Apr 9 17:52:17 sr-0 nagios: Warning: The results of service '[SYS]
>> Swap Usage' on host 'domian.pt_sfci-dr-0' are stale by 40 seconds
>> (threshold=500 seconds). I'm forcing an immediate check of the service.
>>
>> Well, when this happens the command 'service-is-stale' get executed
>> placing the service in an UNKNOWN status and consequently the host
>> status changes to UP.
>>
>> Now, let me shoot my question, aren't supposed the services checks
>> for a stated DOWN host be ignored? This is causing the UP/DOWN
>> flapping instability, I remember that there aren't any other
>> distributed servers ou clients submiting results, NSCA isn't even
>> running at this time. Any clues?
>>
>> I'm running nagios version 2.0b2. (I know there is 2.0b3 but since I
>> havent found any changlog references on this subject, I am aiming for
>> a configuration problem)
>>
>> Thanks very much,
>>
>> AD
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when
>> reporting any issue. ::: Messages without supporting info will risk
>> being sent to /dev/null
>
>
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. ::: Messages without supporting info will risk
> being sent to /dev/null
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list