Service status not resetting
Brett Stevens
brett.stevens at hubbub.com.au
Thu Feb 17 05:50:18 CET 2005
Hi. Ive only just started to use nagios and so far it has been a great
system. Ive configured a plugin check_rrd_data to check rrds created by
cacti. This seems to be working well but if a service goes to critical it
never returns. For example server x cpu0 util goes to 99.9999 for a few
minutes I get a critical showing in the gui and an email. This is exactly
what I would expect
server x Down Date duration message CPU Util
CRITICAL: 99.9999
however when it comes back on line I get the same except the message will
show a good value such as CPU OK: 25.99 but the gui shows the host as still
down.
I think this shows that the plugin is working but I may have screwed up the
server config or the service config.
This behaviour exists if a server is non contactable as well and shows the
same behaviour in the host detail cgi
Service config for the example (sanitised)
define service{
use generic-service
host_name problem_server
service_description CPU0 Utilization
check_command check_rrd_data!$USER3$/grandma_cpudpc_71.rrd
!cpuProcessor!50!70!CPU
}
Generic service def
define service{
name generic-service
active_checks_enabled 1 ; Active service checks are
enabled
passive_checks_enabled 1 ; Passive service checks are
enabled/accepted
parallelize_check 1 ; Active service checks should be
parallelized
obsess_over_service 1 ; We should obsess over this
service (if necessary)
check_freshness 0 ; Default is to NOT check service
'freshness'
check_period 24x7
notifications_enabled 1 ; Service notifications are
enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across
program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
max_check_attempts 3 ; Number of times ito check before
sending an alert.
normal_check_interval 5 ; Check the service every 5 mins
retry_check_interval 1 ; Time to wait before scheduling a
re-check of a service
notification_interval 5 ; The number of "time units" to
wait before re-notifying a contact that this service is still in a non-OK
state.Time units are minutes
notification_period 24x7
notifications_enabled 1 ; Enable notifications
contact_groups Win32-Admins
register 0
}
host def
define host{
host_name problem_server
alias problem_server
address www.xxx.yyy <http://www.xxx.yyy> .zzz
use generic-win32
parents vLan3,vLan4
}
generic host def
define host{
name generic-win32
check_command check_http
max_check_attempts 5
process_perf_data 0
retain_nonstatus_information 0
notification_interval 30
notification_period 24x7
notification_options d,u,r
contact_groups Win32-Admins
notifications_enabled 1 ; Host notifications
are enabled
event_handler_enabled 1 ; Host event handler is
enabled
flap_detection_enabled 1 ; Flap detection is
enabled
process_perf_data 1 ; Process
performance data
retain_status_information 1 ; Retain status
information across program restarts
retain_nonstatus_information 1 ; Retain non-status
information across program restarts
register 0 ; DONT
REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
Ive probably screwed up a definition somewhere as I have been mucking around
a bit to try different config layouts. We have quite a few servers and gear
to montior
Any help would be greatly apreciated.
thanks in advance
Brett Stevens
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050217/1ee9788f/attachment.html>
More information about the Users
mailing list