Nagios failing to check services

Matt Pounsett matt.pounsett at cira.ca
Tue Sep 16 19:49:31 CEST 2003


On Tue, 16 Sep 2003, Jason Lancaster wrote:

> Matt,
> You might want to log into the nagios web interface and load up all 
> services. Then sort by "Last Check" with the newest at the top. Watch it 
> for a few hours and see if the checks are never getting updated (or very 
> very slow) Also check the performance info button in the web interface 
> for average check latency/execution time. Maybe you have a pretty hefty 
> latency time?

Right this minute, the oldest 'last check' stat is about 13 minutes old.. not impressive.  

Something interesting about this stat: I noted earlier that when I restarted
Nagios it updated the Last Check (and logged a successfull check!) for the SSH
and NTP services on the server I had powered off. 

As for latency, nothing seriously out of whack is being reported there right
now.  
                        Min   Max  Avg
Check Execution Time:  <1sec  9sec 1.448sec
Check Latency:         <1sec <1sec 0.000sec
Percent State Change:  0.00% 0.00% 0.00%

Active checks is a bit disconcerting.. 

Time Frame            Checks Completed
<= 1 minute           6 (9.0%)         -- not a big deal
<= 5 minutes          49 (73.1%)       -- this is an issue
<= 15 minutes         62 (92.5%)       -- really bugs me
<= 1 hour             67 (100.0%)
Since program start:  67 (100.0%)

There are no passive checks running... so all zero's there.

> If you are indeed hitting a bottleneck in performance, it would most 
> likely be due to the core nagios.cfg options. I don't see a copy of your 
> config files. If you'd like, you can forward me a copy of your nagios.cfg.

I included all the apparently relevant nagios.cfg entries in my first email on
the subject.. but I'm attaching the whole file here.  Hopefully someone has
some ideas.

thanks again Jason
   Matt

-- 
Matt Pounsett                 CIRA - Canadian Internet Registration Authority
Technical Support Programmer                    350 Sparks Street, Suite 1110
matt.pounsett at cira.ca                                 Ottawa, Ontario, Canada
613.237.5335 ext. 231                                      http://www.cira.ca
-------------- next part --------------
##############################################################################
#
# NAGIOS.CFG - Sample Main Config File for Nagios 
#
# Read the documentation for more information on this configuration
# file.  I've provided some comments here, but things may not be so
# clear without further explanation.
#
# Last Modified: 07-04-2002
#
##############################################################################


# LOG FILE
# This is the main log file where service and host events are logged
# for historical purposes.  This should be the first option specified 
# in the config file!!!

log_file=/usr/local/nagios/var/nagios.log



# OBJECT CONFIGURATION FILE(S)
# This is the configuration file in which you define hosts, host
# groups, contacts, contact groups, services, etc.  I guess it would
# be better called an object definition file, but for historical
# reasons it isn't.  You can split object definitions into several
# different config files by using multiple cfg_file statements here.
# Nagios will read and process all the config files you define.
# This can be very useful if you want to keep command definitions 
# separate from host and contact definitions...

# Plugin commands (service and host check commands)
# Arguments are likely to change between different releases of the
# plugins, so you should use the same config file provided with the
# plugin release rather than the one provided with Nagios.
#cfg_file=/usr/local/nagios/etc/checkcommands.cfg

# Misc commands (notification and event handler commands, etc)
#cfg_file=/usr/local/nagios/etc/misccommands.cfg

# You can split other types of object definitions across several
# config files if you wish (as done here), or keep them all in a
# single config file.

#cfg_file=/usr/local/nagios/etc/contactgroups.cfg
#cfg_file=/usr/local/nagios/etc/contacts.cfg
#cfg_file=/usr/local/nagios/etc/dependencies.cfg
#cfg_file=/usr/local/nagios/etc/escalations.cfg
#cfg_file=/usr/local/nagios/etc/hostgroups.cfg
#cfg_file=/usr/local/nagios/etc/hosts.cfg
#cfg_file=/usr/local/nagios/etc/services.cfg
#cfg_file=/usr/local/nagios/etc/timeperiods.cfg

cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/object-templates.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg

cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_dir=/usr/local/nagios/etc/hosts-primary/
cfg_dir=/usr/local/nagios/etc/hosts-backup/
cfg_dir=/usr/local/nagios/etc/hosts-sparks/

# RESOURCE FILE
# This is an optional resource file that contains $USERx$ macro
# definitions. Multiple resource files can be specified by using
# multiple resource_file definitions.  The CGIs will not attempt to
# read the contents of resource files, so information that is
# considered to be sensitive (usernames, passwords, etc) can be
# defined as macros in this file and restrictive permissions (600)
# can be placed on this file.

resource_file=/usr/local/nagios/etc/resources.cfg



# STATUS FILE
# This is where the current status of all monitored services and
# hosts is stored.  Its contents are read and processed by the CGIs.
# The contentsof the status file are deleted every time Nagios
#  restarts.

status_file=/usr/local/nagios/var/status.log



# NAGIOS USER
# This determines the effective user that Nagios should run as.  
# You can either supply a username or a UID.

nagios_user=nagios



# NAGIOS GROUP
# This determines the effective group that Nagios should run as.  
# You can either supply a group name or a GID.

nagios_group=daemon



# EXTERNAL COMMAND OPTION
# This option allows you to specify whether or not Nagios should check
# for external commands (in the command file defined below).  By default
# Nagios will *not* check for external commands, just to be on the
# cautious side.  If you want to be able to use the CGI command interface
# you will have to enable this.  Setting this value to 0 disables command
# checking (the default), other values enable it.

check_external_commands=1



# EXTERNAL COMMAND CHECK INTERVAL
# This is the interval at which Nagios should check for external commands.
# This value works of the interval_length you specify later.  If you leave
# that at its default value of 60 (seconds), a value of 1 here will cause
# Nagios to check for external commands every minute.  If you specify a
# number followed by an "s" (i.e. 15s), this will be interpreted to mean
# actual seconds rather than a multiple of the interval_length variable.
# Note: In addition to reading the external command file at regularly 
# scheduled intervals, Nagios will also check for external commands after
# event handlers are executed.
# NOTE: Setting this value to -1 causes Nagios to check the external
# command file as often as possible.

#command_check_interval=1
#command_check_interval=15s
command_check_interval=-1



# EXTERNAL COMMAND FILE
# This is the file that Nagios checks for external command requests.
# It is also where the command CGI will write commands that are submitted
# by users, so it must be writeable by the user that the web server
# is running as (usually 'nobody').  Permissions should be set at the 
# directory level instead of on the file, as the file is deleted every
# time its contents are processed.

command_file=/usr/local/nagios/var/rw/nagios.cmd



# COMMENT FILE
# This is the file that Nagios will use for storing host and service
# comments.

comment_file=/usr/local/nagios/var/comment.log



# DOWNTIME FILE
# This is the file that Nagios will use for storing host and service
# downtime data.

downtime_file=/usr/local/nagios/var/downtime.log



# LOCK FILE
# This is the lockfile that Nagios will use to store its PID number
# in when it is running in daemon mode.

lock_file=/usr/local/nagios/var/nagios.lock



# TEMP FILE
# This is a temporary file that is used as scratch space when Nagios
# updates the status log, cleans the comment file, etc.  This file
# is created, used, and deleted throughout the time that Nagios is
# running.

temp_file=/usr/local/nagios/var/nagios.tmp



# LOG ROTATION METHOD
# This is the log rotation method that Nagios should use to rotate
# the main log file. Values are as follows..
#	n	= None - don't rotate the log
#	h	= Hourly rotation (top of the hour)
#	d	= Daily rotation (midnight every day)
#	w	= Weekly rotation (midnight on Saturday evening)
#	m	= Monthly rotation (midnight last day of month)

log_rotation_method=d



# LOG ARCHIVE PATH
# This is the directory where archived (rotated) log files should be 
# placed (assuming you've chosen to do log rotation).

log_archive_path=/usr/local/nagios/var/archives



# LOGGING OPTIONS
# If you want messages logged to the syslog facility, as well as the
# NetAlarm log file set this option to 1.  If not, set it to 0.

use_syslog=0



# NOTIFICATION LOGGING OPTION
# If you don't want notifications to be logged, set this value to 0.
# If notifications should be logged, set the value to 1.

log_notifications=1



# SERVICE RETRY LOGGING OPTION
# If you don't want service check retries to be logged, set this value
# to 0.  If retries should be logged, set the value to 1.

log_service_retries=1



# HOST RETRY LOGGING OPTION
# If you don't want host check retries to be logged, set this value to
# 0.  If retries should be logged, set the value to 1.

log_host_retries=1



# EVENT HANDLER LOGGING OPTION
# If you don't want host and service event handlers to be logged, set
# this value to 0.  If event handlers should be logged, set the value
# to 1.

log_event_handlers=1



# INITIAL STATES LOGGING OPTION
# If you want Nagios to log all initial host and service states to
# the main log file (the first time the service or host is checked)
# you can enable this option by setting this value to 1.  If you
# are not using an external application that does long term state
# statistics reporting, you do not need to enable this option.  In
# this case, set the value to 0.

log_initial_states=1



# EXTERNAL COMMANDS LOGGING OPTION
# If you don't want Nagios to log external commands, set this value
# to 0.  If external commands should be logged, set this value to 1.
# Note: This option does not include logging of passive service
# checks - see the option below for controlling whether or not
# passive checks are logged.

log_external_commands=1



# PASSIVE SERVICE CHECKS LOGGING OPTION
# If you don't want Nagios to log passive service checks, set this
# value to 0.  If passive service checks should be logged, set this
# value to 1.

log_passive_service_checks=1



# GLOBAL HOST AND SERVICE EVENT HANDLERS
# These options allow you to specify a host and service event handler
# command that is to be run for every host or service state change.
# The global event handler is executed immediately prior to the event
# handler that you have optionally specified in each host or
# service definition. The command argument is the short name of a
# command definition that you define in your host configuration file.
# Read the HTML docs for more information.

#global_host_event_handler=somecommand
#global_service_event_handler=somecommand



# INTER-CHECK DELAY METHOD
# This is the method that Nagios should use when initially
# "spreading out" service checks when it starts monitoring.  The
# default is to use smart delay calculation, which will try to
# space all service checks out evenly to minimize CPU load.
# Using the dumb setting will cause all checks to be scheduled
# at the same time (with no delay between them)!  This is not a
# good thing for production, but is useful when testing the
# parallelization functionality.
#	n	= None - don't use any delay between checks
#	d	= Use a "dumb" delay of 1 second between checks
#	s	= Use "smart" inter-check delay calculation
#       x.xx    = Use an inter-check delay of x.xx seconds

inter_check_delay_method=s



# SERVICE CHECK INTERLEAVE FACTOR
# This variable determines how service checks are interleaved.
# Interleaving the service checks allows for a more even
# distribution of service checks and reduced load on remote
# hosts.  Setting this value to 1 is equivalent to how versions
# of Nagios previous to 0.0.5 did service checks.  Set this
# value to s (smart) for automatic calculation of the interleave
# factor unless you have a specific reason to change it.
#       s       = Use "smart" interleave factor calculation
#       x       = Use an interleave factor of x, where x is a
#                 number greater than or equal to 1.

service_interleave_factor=s



# MAXIMUM CONCURRENT SERVICE CHECKS
# This option allows you to specify the maximum number of 
# service checks that can be run in parallel at any given time.
# Specifying a value of 1 for this variable essentially prevents
# any service checks from being parallelized.  A value of 0
# will not restrict the number of concurrent checks that are
# being executed.

max_concurrent_checks=0



# SERVICE CHECK REAPER FREQUENCY
# This is the frequency (in seconds!) that Nagios will process
# the results of services that have been checked.

service_reaper_frequency=10



# SLEEP TIME
# This is the number of seconds to sleep between checking for system
# events and service checks that need to be run.  I would recommend
# *not* changing this from its default value of 1 second.

sleep_time=1



# TIMEOUT VALUES
# These options control how much time Nagios will allow various
# types of commands to execute before killing them off.  Options
# are available for controlling maximum time allotted for
# service checks, host checks, event handlers, notifications, the
# ocsp command, and performance data commands.  All values are in
# seconds.

service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5



# RETAIN STATE INFORMATION
# This setting determines whether or not Nagios will save state
# information for services and hosts before it shuts down.  Upon
# startup Nagios will reload all saved service and host state
# information before starting to monitor.  This is useful for 
# maintaining long-term data on state statistics, etc, but will
# slow Nagios down a bit when it (re)starts.  Since its only
# a one-time penalty, I think its well worth the additional
# startup delay.

retain_state_information=1



# STATE RETENTION FILE
# This is the file that Nagios should use to store host and
# service state information before it shuts down.  The state 
# information in this file is also read immediately prior to
# starting to monitor the network when Nagios is restarted.
# This file is used only if the preserve_state_information
# variable is set to 1.

state_retention_file=/usr/local/nagios/var/status.sav



# RETENTION DATA UPDATE INTERVAL
# This setting determines how often (in minutes) that Nagios
# will automatically save retention data during normal operation.
# If you set this value to 0, Nagios will not save retention
# data at regular interval, but it will still save retention
# data before shutting down or restarting.  If you have disabled
# state retention, this option has no effect.

retention_update_interval=60



# USE RETAINED PROGRAM STATE
# This setting determines whether or not Nagios will set 
# program status variables based on the values saved in the
# retention file.  If you want to use retained program status
# information, set this value to 1.  If not, set this value
# to 0.

use_retained_program_state=0



# INTERVAL LENGTH
# This is the seconds per unit interval as used in the
# host/contact/service configuration files.  Setting this to 60 means
# that each interval is one minute long (60 seconds).  Other settings
# have not been tested much, so your mileage is likely to vary...

interval_length=60



# AGRESSIVE HOST CHECKING OPTION
# If you don't want to turn on agressive host checking features, set
# this value to 0 (the default).  Otherwise set this value to 1 to
# enable the agressive check option.  Read the docs for more info
# on what agressive host check is or check out the source code in
# base/checks.c

use_agressive_host_checking=0



# SERVICE CHECK EXECUTION OPTION
# This determines whether or not Nagios will actively execute
# service checks when it initially starts.  If this option is 
# disabled, checks are not actively made, but Nagios can still
# receive and process passive check results that come in.  Unless
# you're implementing redundant hosts or have a special need for
# disabling the execution of service checks, leave this enabled!
# Values: 1 = enable checks, 0 = disable checks

execute_service_checks=1



# PASSIVE CHECK ACCEPTANCE OPTION
# This determines whether or not Nagios will accept passive
# service checks results when it initially (re)starts.
# Values: 1 = accept passive checks, 0 = reject passive checks

accept_passive_service_checks=1



# NOTIFICATIONS OPTION
# This determines whether or not Nagios will sent out any host or
# service notifications when it is initially (re)started.
# Values: 1 = enable notifications, 0 = disable notifications

enable_notifications=1



# EVENT HANDLER USE OPTION
# This determines whether or not Nagios will run any host or
# service event handlers when it is initially (re)started.  Unless
# you're implementing redundant hosts, leave this option enabled.
# Values: 1 = enable event handlers, 0 = disable event handlers

enable_event_handlers=1



# PROCESS PERFORMANCE DATA OPTION
# This determines whether or not Nagios will process performance
# data returned from service and host checks.  If this option is
# enabled, host performance data will be processed using the
# host_perfdata_command (defined below) and service performance
# data will be processed using the service_perfdata_command (also
# defined below).  Read the HTML docs for more information on
# performance data.
# Values: 1 = process performance data, 0 = do not process performance data

process_performance_data=0



# HOST AND SERVICE PERFORMANCE DATA PROCESSING COMMANDS
# These commands are run after every host and service check is
# performed.  These commands are executed only if the
# enable_performance_data option (above) is set to 1.  The command
# argument is the short name of a command definition that you 
# define in your host configuration file.  Read the HTML docs for
# more information on performance data.

#host_perfdata_command=process-host-perfdata
#service_perfdata_command=process-service-perfdata



# OBSESS OVER SERVICE CHECKS OPTION
# This determines whether or not Nagios will obsess over service
# checks and run the ocsp_command defined below.  Unless you're
# planning on implementing distributed monitoring, do not enable
# this option.  Read the HTML docs for more information on
# implementing distributed monitoring.
# Values: 1 = obsess over services, 0 = do not obsess (default)

obsess_over_services=0



# OBSESSIVE COMPULSIVE SERVICE PROCESSOR COMMAND
# This is the command that is run for every service check that is
# processed by Nagios.  This command is executed only if the
# obsess_over_service option (above) is set to 1.  The command 
# argument is the short name of a command definition that you
# define in your host configuration file. Read the HTML docs for
# more information on implementing distributed monitoring.

#ocsp_command=somecommand



# ORPHANED SERVICE CHECK OPTION
# This determines whether or not Nagios will periodically 
# check for orphaned services.  Since service checks are not
# rescheduled until the results of their previous execution 
# instance are processed, there exists a possibility that some
# checks may never get rescheduled.  This seems to be a rare
# problem and should not happen under normal circumstances.
# If you have problems with service checks never getting
# rescheduled, you might want to try enabling this option.
# Values: 1 = enable checks, 0 = disable checks

check_for_orphaned_services=1



# SERVICE FRESHNESS CHECK OPTION
# This option determines whether or not Nagios will periodically
# check the "freshness" of service results.  Enabling this option
# is useful for ensuring passive checks are received in a timely
# manner.
# Values: 1 = enabled freshness checking, 0 = disable freshness checking

check_service_freshness=1



# FRESHNESS CHECK INTERVAL
# This setting determines how often (in seconds) Nagios will
# check the "freshness" of service check results.  If you have
# disabled service freshness checking, this option has no effect.

freshness_check_interval=60



# AGGREGATED STATUS UPDATES
# This option determines whether or not Nagios will 
# aggregate updates of host, service, and program status
# data.  Normally, status data is updated immediately when
# a change occurs.  This can result in high CPU loads if
# you are monitoring a lot of services.  If you want Nagios
# to only refresh status data every few seconds, disable
# this option.
# Values: 1 = enable aggregate updates, 0 = disable aggregate updates

aggregate_status_updates=1



# AGGREGATED STATUS UPDATE INTERVAL
# Combined with the aggregate_status_updates option,
# this option determines the frequency (in seconds!) that
# Nagios will periodically dump program, host, and 
# service status data.  If you are not using aggregated
# status data updates, this option has no effect.

status_update_interval=15



# FLAP DETECTION OPTION
# This option determines whether or not Nagios will try
# and detect hosts and services that are "flapping".  
# Flapping occurs when a host or service changes between
# states too frequently.  When Nagios detects that a 
# host or service is flapping, it will temporarily supress
# notifications for that host/service until it stops
# flapping.  Flap detection is very experimental, so read
# the HTML documentation before enabling this feature!
# Values: 1 = enable flap detection
#         0 = disable flap detection (default)

enable_flap_detection=1



# FLAP DETECTION THRESHOLDS FOR HOSTS AND SERVICES
# Read the HTML documentation on flap detection for
# an explanation of what this option does.  This option
# has no effect if flap detection is disabled.

low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0



# DATE FORMAT OPTION
# This option determines how short dates are displayed. Valid options
# include:
#	us		(MM-DD-YYYY HH:MM:SS)
#	euro    	(DD-MM-YYYY HH:MM:SS)
#	iso8601		(YYYY-MM-DD HH:MM:SS)
#	strict-iso8601	(YYYY-MM-DDTHH:MM:SS)
#

date_format=iso8601



# ILLEGAL OBJECT NAME CHARACTERS
# This options allows you to specify illegal characters that cannot
# be used in host names, service descriptions, or names of other
# object types.

illegal_object_name_chars=`~!$%^&*|'"<>?,()=



# ILLEGAL MACRO OUTPUT CHARACTERS
# This options allows you to specify illegal characters that are
# stripped from macros before being used in notifications, event
# handlers, etc.  This DOES NOT affect macros used in service or
# host check commands.
# The following macros are stripped of the characters you specify:
# 	$OUTPUT$, $PERFDATA$

illegal_macro_output_chars=`~$&|'"<>



# ADMINISTRATOR EMAIL ADDRESS
# The email address of the administrator of *this* machine (the one
# doing the monitoring).  Nagios never uses this value itself, but
# you can access this value by using the $ADMINEMAIL$ macro in your
# notification commands.

admin_email=nagios at cira.ca



# ADMINISTRATOR PAGER NUMBER/ADDRESS
# The pager number/address for the administrator of *this* machine.
# Nagios never uses this value itself, but you can access this
# value by using the $ADMINPAGER$ macro in your notification
# commands.

admin_pager=pagenagios



# EOF (End of file)



More information about the Users mailing list