Distributed configuration issue with staleness (thresholds?)
Greg Cockburn
gergnz at gmail.com
Wed Jun 29 00:14:30 CEST 2005
Hi Bob,
I know this isn't going to help you much, but I am having the same
problem. I have about 50 machines/network devices with about 500
services over 3 sites connected via VPNs. I was running all these
sites with seperate instances of Nagios, but am now trying to get the
distributed monitoring working.
For the most part it seems to be good, except that service checks go
stale before the remote server has a chance to send a passive check
result.
The problem with this is I get a few false positive notifications.
(not good at 3am in the morning) The other weird thing is, that the
Master host tries to do an 'active' check of the service, even when
active checks are disabled for that service on the master. Why?
I have been playing around with a lot of the timings for various
things, and scouring the mailing lists for other peoples adventures
trying to piece it all together, with only so much luck.
I think the key is to carefully, record and trial different values for
various timeouts until you have a set that is working for your
environment, but YMMV, as I have found.
Good luck, and keep us posted.
Greg.
On 6/29/05, Bob Johnson <bobjohnson at nexus9000.com> wrote:
> Greetings to all,
>
> In my test configuration, I have one server as the distributed node and
> the other as the master node. The distributed node does all of the
> checking and sends its check results to the master node via NSCA. The
> checks are sent (and received) in a normal fashion to the master node, but
> for some reason I am having issues with the freshness threshold on the
> master server. The nagios.log excerpt below states that the check is
> stale by "7" seconds even though there is a threshold of "200" seconds.
> Therefore, I believe that I must be overlooking something in the
> configuration and would appreciate any advice. (As a side note, quite a
> few services go into the stale mode at once, and not just this single
> check. However, not *every* service immediately goes stale.)
>
>
> From nagios.log on the master server:
>
> "Warning: The results of service 'time' on host 'server1' are stale by 7
> seconds (threshold=200 seconds). I'm forcing an immediate check of the
> service."
>
>
> Alas, I have a few questions:
>
> a. Where exactly is this "7" coming from (or calculated) in the
> configuration?
>
> b. Where exactly is this "200" coming from (or calculated) in the
> configuration?
>
> c. Is there a recommended complete (yet barebones) master and distributed
> node configuration reference for nagios.cfg, hosts.cfg, and services.cfg?
>
> d. Are there any other logs or additional debugging details which would be
> useful for this distributed staleness issue?
>
>
> ---------------------------------------------------------
>
> [distributed node: hosts.cfg]
>
> # Generic host definition template
> define host{
> name nagios-host
> notifications_enabled 0
> event_handler_enabled 1
> flap_detection_enabled 1
> process_perf_data 1
> retain_status_information 1
> retain_nonstatus_information 1
> register 0
> }
>
> # 'server1' host definition
> define host{
> use nagios-host
> host_name server1
> alias server1
> address 10.10.10.10
> check_command check-host-alive
> contact_groups nagios-admins
> max_check_attempts 10
> notification_interval 120
> notification_period 24x7
> notification_options d,u,r
> }
>
> ---------------------------------------------------------
>
> [distributed node: services.cfg]
>
> define service{
> name nagios-host
> active_checks_enabled 1
> passive_checks_enabled 0
> parallelize_check 1
> obsess_over_service 1
> notifications_enabled 0
> notification_interval 60
> notification_period 24x7
> notification_options w,u,c,r
> event_handler_enabled 1
> flap_detection_enabled 1
> contact_groups nagios-admins
> process_perf_data 1
> retain_status_information 1
> retain_nonstatus_information 1
> is_volatile 0
> max_check_attempts 3
> check_period 24x7
> normal_check_interval 3
> retry_check_interval 1
> register 0
> }
>
>
> define service{
> use nagios-host
> host_name server1
> service_description time
> check_command check_ntp!3!10
> }
>
> ---------------------------------------------------------
>
> [master node: hosts.cfg]
>
> # Generic host definition template
> define host{
> name nagios-host
> notifications_enabled 1
> event_handler_enabled 1
> flap_detection_enabled 1
> process_perf_data 1
> retain_status_information 1
> retain_nonstatus_information 1
> register 0
> }
>
> # 'server1' host definition
> define host{
> use nagios-host
> host_name server1
> alias server1
> address 10.10.10.10
> max_check_attempts 3
> contact_groups nagios-admins
> notification_interval 120
> notification_period 24x7
> notification_options d,u,r
> }
>
> ---------------------------------------------------------
>
> [master node: services.cfg]
>
> # infrastructure host template
> define service{
> name nagios-host
> active_checks_enabled 0
> passive_checks_enabled 1
> parallelize_check 1
> obsess_over_service 1
> check_freshness 1
> freshness_threshold 300
> check_command service-is-stale
> notifications_enabled 1
> notification_interval 60
> notification_period 24x7
> notification_options w,u,c,r
> event_handler_enabled 1
> flap_detection_enabled 1
> contact_groups nagios-admins
> process_perf_data 1
> retain_status_information 1
> retain_nonstatus_information 1
> is_volatile 0
> max_check_attempts 3
> check_period 24x7
> normal_check_interval 3
> retry_check_interval 1
> register 0
> }
>
> define service{
> use nagios-host
> host_name server1
> service_description time
> }
>
> ---------------------------------------------------------
>
>
> Thank you for any guidance or assistance with troubleshooting.
>
> Cheers, Bob
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list