Distributed configuration issue with staleness (thresholds?)

Greg Cockburn gergnz at gmail.com
Wed Jun 29 00:14:30 CEST 2005
Previous message: Distributed configuration issue with staleness (thresholds?)
Next message: Distributed configuration issue with staleness (thresholds?)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi Bob,

I know this isn't going to help you much, but I am having the same
problem.  I have about 50 machines/network devices with about 500
services over 3 sites connected via VPNs.  I was running all these
sites with seperate instances of Nagios, but am now trying to get the
distributed monitoring working.

For the most part it seems to be good, except that service checks go
stale before the remote server has a chance to send a passive check
result.

The problem with this is I get a few false positive notifications.
(not good at 3am in the morning) The other weird thing is, that the
Master host tries to do an 'active' check of the service, even when
active checks are disabled for that service on the master.  Why?

I have been playing around with a lot of the timings for various
things, and scouring the mailing lists for other peoples adventures
trying to piece it all together, with only so much luck.

I think the key is to carefully, record and trial different values for
various timeouts until you have a set that is working for your
environment, but YMMV, as I have found.

Good luck, and keep us posted.

Greg.

On 6/29/05, Bob Johnson <bobjohnson at nexus9000.com> wrote:
> Greetings to all,
> 
> In my test configuration, I have one server as the distributed node and
> the other as the master node.  The distributed node does all of the
> checking and sends its check results to the master node via NSCA.  The
> checks are sent (and received) in a normal fashion to the master node, but
> for some reason I am having issues with the freshness threshold on the
> master server.  The nagios.log excerpt below states that the check is
> stale by "7" seconds even though there is a threshold of "200" seconds.
> Therefore, I believe that I must be overlooking something in the
> configuration and would appreciate any advice.  (As a side note, quite a
> few services go into the stale mode at once, and not just this single
> check.  However, not *every* service immediately goes stale.)
> 
> 
> From nagios.log on the master server:
> 
> "Warning: The results of service 'time' on host 'server1' are stale by 7
> seconds (threshold=200 seconds).  I'm forcing an immediate check of the
> service."
> 
> 
> Alas, I have a few questions:
> 
> a. Where exactly is this "7" coming from (or calculated) in the
> configuration?
> 
> b. Where exactly is this "200" coming from (or calculated) in the
> configuration?
> 
> c. Is there a recommended complete (yet barebones) master and distributed
> node configuration reference for nagios.cfg, hosts.cfg, and services.cfg?
> 
> d. Are there any other logs or additional debugging details which would be
> useful for this distributed staleness issue?
> 
> 
> ---------------------------------------------------------
> 
> [distributed node: hosts.cfg]
> 
> # Generic host definition template
> define host{
>         name                            nagios-host
>         notifications_enabled           0
>         event_handler_enabled           1
>         flap_detection_enabled          1
>         process_perf_data               1
>         retain_status_information       1
>         retain_nonstatus_information    1
>         register                        0
>         }
> 
> # 'server1' host definition
> define host{
>         use                     nagios-host
>         host_name               server1
>         alias                   server1
>         address                 10.10.10.10
>         check_command           check-host-alive
>         contact_groups          nagios-admins
>         max_check_attempts      10
>         notification_interval   120
>         notification_period     24x7
>         notification_options    d,u,r
>         }
> 
> ---------------------------------------------------------
> 
> [distributed node: services.cfg]
> 
> define service{
>         name                            nagios-host
>         active_checks_enabled           1
>         passive_checks_enabled          0
>         parallelize_check               1
>         obsess_over_service             1
>         notifications_enabled           0
>         notification_interval           60
>         notification_period             24x7
>         notification_options            w,u,c,r
>         event_handler_enabled           1
>         flap_detection_enabled          1
>         contact_groups                  nagios-admins
>         process_perf_data               1
>         retain_status_information       1
>         retain_nonstatus_information    1
>         is_volatile                     0
>         max_check_attempts              3
>         check_period                    24x7
>         normal_check_interval           3
>         retry_check_interval            1
>         register                        0
>         }
> 
> 
> define service{
>         use                     nagios-host
>         host_name               server1
>         service_description     time
>         check_command           check_ntp!3!10
>         }
> 
> ---------------------------------------------------------
> 
> [master node: hosts.cfg]
> 
> # Generic host definition template
> define host{
>         name                            nagios-host
>         notifications_enabled           1
>         event_handler_enabled           1
>         flap_detection_enabled          1
>         process_perf_data               1
>         retain_status_information       1
>         retain_nonstatus_information    1
>         register                        0
>         }
> 
> # 'server1' host definition
> define host{
>         use                     nagios-host
>         host_name               server1
>         alias                   server1
>         address                 10.10.10.10
>         max_check_attempts      3
>         contact_groups          nagios-admins
>         notification_interval   120
>         notification_period     24x7
>         notification_options    d,u,r
>         }
> 
> ---------------------------------------------------------
> 
> [master node: services.cfg]
> 
> # infrastructure host template
> define service{
>         name                            nagios-host
>         active_checks_enabled           0
>         passive_checks_enabled          1
>         parallelize_check               1
>         obsess_over_service             1
>         check_freshness                 1
>         freshness_threshold             300
>         check_command                   service-is-stale
>         notifications_enabled           1
>         notification_interval           60
>         notification_period             24x7
>         notification_options            w,u,c,r
>         event_handler_enabled           1
>         flap_detection_enabled          1
>         contact_groups                  nagios-admins
>         process_perf_data               1
>         retain_status_information       1
>         retain_nonstatus_information    1
>         is_volatile                     0
>         max_check_attempts              3
>         check_period                    24x7
>         normal_check_interval           3
>         retry_check_interval            1
>         register                        0
>         }
> 
> define service{
>         use                     nagios-host
>         host_name               server1
>         service_description     time
>         }
> 
> ---------------------------------------------------------
> 
> 
> Thank you for any guidance or assistance with troubleshooting.
> 
> Cheers, Bob
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Distributed configuration issue with staleness (thresholds?)
Next message: Distributed configuration issue with staleness (thresholds?)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list