Service checks pending forever in distributed monitoring configuration
Fred
f1216 at yahoo.com
Thu Sep 1 19:53:03 CEST 2005
I have a 1000+ node system plus a number of switches etc that are all
monitored by Nagios. I'm running 2.0b3.
Our configuration is generated automatically based on the clusters
configuration and in smaller configurations has no issues.
Recently, nagios started delaying execution of active service checks. I
have 5 nagios monitors reporting via nsca to a 6th nagios master (which also
monitors 1/6th of the cluster). I removed all the retention caches for all
the monitor nodes and restarted. Nagios then reports that the next service
check is scheduled for hours later (when it should be fairly close). Attached
is output from nagiostats. There are quite a few services, most all are
passive checks with each monitor node running some active checks that will
push data to the FIFO where it is then picked up and reported on a
per-node/service basis. The pending checks do not execute even when the
time passes. The monitor nodes are working just fine, the master node which
is configured to obsessing is disabled (on the master) and freshness checking
is enabled. There is nothing in nagios.log other then stale check messages.
Following is an example service description from a service that
is not getting scheduled:
define service{
use nagios
host_name nh
name slurmMonitor
service_description Slurm Monitor
active_checks_enabled 1
check_command check_slurm
register 1
}
and the template:
# Generic template for services
define service{
use generic-service ; default service
name nagios
normal_check_interval 5
retry_check_interval 2
check_period 24x7
is_volatile 0
max_check_attempts 3
notification_interval 240
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
register 0
}
and finally, the generic-service template:
# Generic service definition template
define service{
name generic-service ; The 'name' of this
service template, referenced in other service definitions
active_checks_enabled 1 ; Active service checks are
enabled
passive_checks_enabled 1 ; Passive service checks are
enabled/accepted
parallelize_check 1 ; Active service checks should
be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this
service (if necessary)
check_freshness 0 ; Default is to NOT check
service 'freshness'
notifications_enabled 1 ; Service notifications are
enabled
event_handler_enabled 1 ; Service event handler is
enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information
across program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
register 0 ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
Clocks are correct and synchronized on the system.
Nagios Stats 2.0b3
Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
Last Modified: 04-03-2005
License: GPL
CURRENT STATUS DATA
----------------------------------------------------
Status File: /opt/hptc/nagios/var/status.log
Status File Age: 0d 0h 0m 1s
Status File Version: 2.0b3
Program Running Time: 0d 48h 0m 56s
Total Services: 10388
Services Checked: 8472
Services Scheduled: 246
Active Service Checks: 4774
Passive Service Checks: 5614
Total Service State Change: 0.000 / 63.550 / 2.210 %
Active Service Latency: 0.000 / 2714.925 / 1220.973 %
Active Service Execution Time: 0.000 / 180.065 / 0.119 sec
Active Service State Change: 0.000 / 17.830 / 1.222 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 4
Passive Service State Change: 0.000 / 63.550 / 3.050 %
Passive Services Last 1/5/15/60 min: 0 / 440 / 2566 / 4724
Services Ok/Warn/Unk/Crit: 7420 / 2866 / 0 / 102
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 1094
Hosts Checked: 1030
Hosts Scheduled: 0
Active Host Checks: 1094
Passive Host Checks: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.000 / 0.000 / 0.000 %
Active Host Execution Time: 0.000 / 0.000 / 0.000 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 1094 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Anyone have any suggestions as to what to look for next?
If I force the scheduling of the service, it eventually gets scheduled
and runs, it does update the pending time in the web display right away.
Thanks in advance for any insight.
-FredC
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list