freshness check on passive service fails
Antoine Reid
areid at logient.com
Thu May 27 23:13:16 CEST 2004
--On Monday, May 24, 2004 12:48 PM +0200 Michael Huettig
<Michael.Huettig at Medien-Systempartner.de> wrote:
> Hi all,
> i´m using nagios 1.2 with nsca/send-nsca 2.4 to submit passive
> check-results from some services. Works fine for more than 6 months but
> last week it starts up making me crazy.
>
> nagios doesn´t accept any value on freshness-threshold, it starts every 5
> minutes the script, which notifies me.
For what it's worth, I'm having similar issues myself too. My setup is a bit
different so I'll post it below. What happens here is that I have two
Nagios processes running on two different hosts, in different subnets. The
one
doing the actual checks is obsessing over services and sends the results
through nsca to the main nagios host. The main host seems to decide my
services results aren't fresh enough, then runs the check_command, which is
a dummy script returning WARNING (originally CRITICAL but it generated too
many notifications..), then, a couple seconds or minutes later, a new
passive
check comes in, which brings the service(s) back to OK, then a couple
minutes
later, it switches back to WARNING and so on..
Both hosts are running FreeBSD, one is on 4.9 (the main host) while the one
performing the actual checks is running 5.2.1. All on i386.
Complete configs can be made available upon request (sent out-of-band to
save
list bandwidth) if I didn't provide enough details..
I'm sure I'm either not using the software the way it's supposed to be, or
I have a configuration glitch, but I can't seem to find it.. I find it so
odd
that the main nagios process would run the service_check only couple
*seconds*
after it has got an "OK" passive check. This type of service is set with
"active_checks_enabled 0" and "check_freshness 1", and I understood it would
only run the service check IF the results aren't fresh enough..
Anyone can shed some light on this?
Here are excerpts from my configs:
On the MAIN nagios machine (the one that receives the passive checks
and does notifications):
nagios.cfg: (not sure what is relevant here..)
ocsp_timeout=5
interval_length=60
execute_service_checks=1
accept_passive_service_checks=1
obsess_over_services=0
check_service_freshness=1
freshness_check_interval=600
and from service.cfg:
define service{
name passive-service
active_checks_enabled 0
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 1
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_command service-is-stale
freshness_threshold 600
register 0
}
define service{
use passive-service
service_description PING
host_name bloodymary.domain.logient.com
contact_groups unix-admins
}
define service{
use passive-service
service_description DNS
host_name bloodymary.domain.logient.com
contact_groups unix-admins
}
(I have a bunch of services with "use passive-service" all configured this
way,
and they all produce the same behaviour..)
Here is the "service-is-stale" command:
define command{
command_name service-is-stale
command_line $USER1$/staleservice.sh
}
And the staleservice.sh script:
#!/bin/sh
/bin/echo "WARNING: Service results are stale!"
exit 1
--------------------------------------------------------------------------
On the *other* machine, also running Nagios, here are the config excerpts:
ocsp_timeout=5
interval_length=60
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=0
enable_event_handlers=1
obsess_over_services=1
ocsp_command=submit_check_result
define service{
name generic-service
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 1
check_freshness 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
}
and in services.cfg:
define service{
use generic-service
host_name bloodymary.domain.logient.com
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 1
retry_check_interval 1
contact_groups contactgroup
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_command check_fping!2000,80%!5000,100%
}
define service{
use generic-service
host_name bloodymary.domain.logient.com
service_description DNS
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 1
retry_check_interval 1
contact_groups contactgroup
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_command check_dig!dev.domain.logient.com!10
}
My submit_check_result looks like this:
define command{
command_name submit_check_result
command_line
/usr/local/libexec/nagios/eventhandlers/submit_check_result $HOSTNAME$
'$SERVICEDESC$' $SERVICESTATE$ '$OUTPUT$'
}
any the script itself contains:
-----
#!/bin/sh
# Arguments:
# $1 = host_name (Short name of host that the service is
# associated with)
# $2 = svc_description (Description of the service)
# $3 = state_string (A string representing the status of
# the given service - "OK", "WARNING", "CRITICAL"
# or "UNKNOWN")
# $4 = plugin_output (A text string that should be used
# as the plugin output for the service checks)
#
# Convert the state string to the corresponding return code
return_code=-1
case "$3" in
OK)
return_code=0
;;
WARNING)
return_code=1
;;
CRITICAL)
return_code=2
;;
UNKNOWN)
return_code=-1
;;
esac
# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server
# Used for debugging only..
#/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >>
/tmp/send_nsca.log
/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" |
/usr/local/libexec/nagios/send_nsca 192.168.10.138 -c
/usr/local/etc/nagios/send_nsca.cfg
-----
I'm using printf instead of echo, otherwise I had problems with some
plugin_output's which didn't work because they contained "%" signs..
------------------------------------------------------
Now, here is what I get on the main machine's log:
[1085691545] SERVICE ALERT:
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691545] SERVICE ALERT:
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200
(loss=0.000000%, rta=0.260000 ms)
[1085691593] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0
seconds response time (dev.domain.logient.com. 1H IN A 192.168.0.201)
[1085691595] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS
ok - 0 seconds response time (dev.domain.logient.com. 1H IN A
192.168.0.201)
[1085691595] SERVICE ALERT:
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691602] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK
- 192.168.0.200 (loss=0.000000%, rta=0.310000 ms)
[1085691605] SERVICE ALERT:
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691605] SERVICE ALERT:
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200
(loss=0.000000%, rta=0.310000 ms)
[1085691652] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0
seconds response time (dev.domain.logient.com. 1H IN A 192.168.0.201)
[1085691655] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS
ok - 0 seconds response time (dev.domain.logient.com. 1H IN A
192.168.0.201)
[1085691656] SERVICE ALERT:
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691663] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK
- 192.168.0.200 (loss=0.000000%, rta=0.330000 ms)
[1085691665] SERVICE ALERT:
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691665] SERVICE ALERT:
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200
(loss=0.000000%, rta=0.330000 ms)
[1085691713] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0
seconds response time (dev.domain.logient.com. 1H IN A 192.168.0.201)
[1085691715] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS
ok - 0 seconds response time (dev.domain.logient.com. 1H IN A
192.168.0.201)
[1085691715] SERVICE ALERT:
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691723] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK
- 192.168.0.200 (loss=0.000000%, rta=0.370000 ms)
[1085691725] SERVICE ALERT:
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results
are stale!
[1085691725] SERVICE ALERT:
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200
(loss=0.000000%, rta=0.370000 ms)
> on host mh2 run´s an cron-script which submits the passive check-result
> via send-nsca every hour. So i receive every hour that
> test-service-passive is o.k. but after 5 Minutes nagios wants to check
> freshness of this service.
as you can see above, I'm using another nagios process instead of cron, but
the
result should be the same..
> Any suggestions, ideas, why nagios doesn´t accept the
> check-freshness-period of 4000 seconds?
>
> Regards,
>
> Michael
Thanks to anyone who read this far :)
antoine
--
Antoine Reid
Administrateur Système - System Administrator
__________________________________________________
Logient Inc.
Solutions de logiciels Internet - Internet Software Solutions
417 St-Pierre, Suite #700
Montréal (Qc) Canada H2Y 2M4
T. 514-282-4118 ext.32
F. 514-288-0033
www.logient.com
*AVIS DE CONFIDENTIALITÉ*
L'information apparaissant dans ce message est légalement privilégiée et
confidentielle. Elle est destinée à l'usage exclusif de son destinataire
tel qu'identifié ci-dessus. Si ce document vous est parvenu par erreur,
soyez par la présente avisé que sa lecture, sa reproduction ou sa
distribution sont strictement interdites. Vous êtes en conséquence prié de
nous aviser immédiatement par téléphone au (514) 282-4118 ou par courriel.
Veuillez de plus détruire le message. Merci.
*CONFIDENTIALITY NOTE*
This message along with any enclosed documents are confidential and are
legally privileged. They are intended only for the person(s) or
organization(s) named above and any other use or disclosure is strictly
forbidden. If this message is received by anyone else, please notify us at
once by telephone (514) 282-4118 or e-mail and destroy this message. Thank
you.
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id149&alloc_id66&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list