More info/leads: Huge delay in scheduling freshness service chec k after 3rd try
Erik Larkin
erik.larkin at nuasis.com
Wed Mar 26 19:18:45 CET 2003
FYI, after a little more research, I think I've narrowed it down to an issue
with the scheduling queue. I tossed a debug option in my stale-service
script that logs the time it's called. Then I cross-referenced those times
with the times that nagios logged a failed freshness check, and the times
that nagios received the response from the stale-service script. The time
difference between when the script is actually called and when nagios logs
the script response is maybe a few seconds, leading me to believe that the
service reaper is ok. However, the delay between when nagios says it failed
a freshness check and is forcing a service check, and when the stale-service
script is actually called, was over 32000 seconds at last failure. So, I'm
now focusing on problems with the scheduling queue. Any ideas, anyone?
-----Original Message-----
From: Erik Larkin
Sent: Tuesday, March 25, 2003 2:44 PM
To: 'nagios-users at lists.sourceforge.net'
Subject: Huge delay in scheduling freshness service check after 3rd try
Allright, I'm finally admitting that I can't figure this one out myself.
Trust me, it's a difficult admission, and has involved much tinkering,
hair-pulling, and searching of mailing lists (although searching doesn't
seem to be working right now for the sourceforge lists?).
Anyways, I've got a nagios architecture with multiple distributed servers
sending check results to a central nagios server via ncsa. The central
server doesn't perform any active checks (no network access to the
distributed network), but is configured to perform a freshness check for a
service called 'Heartbeat' for each distributed instance. The heartbeat is
just a ping of the loopback performed every minute, although I've since
discovered I could have used check_dummy. Seems to be a pretty common
setup, and for the most part it works very well.
Except for the freshness checks. They work fine up until the 3rd failed
freshness check or so, at which point latency skyrockets. From 99 to 280
to 749, on up to thousands and thousands of seconds of latency. The log
reflects a failed freshness check, and a message about forcing the service
(which is the typical echo and exit 2). But the service alert response is
delayed more and more. I've tried everything I can think of, and learned a
great deal in my searching and tweaking, but I can't change this behavior.
Here's what I've tried:
- change the service_reaper_frequency to 3. saw a reference for this in the
list for something else, thought it might help. I still suspect some
problem with the service reaper.
- added a 1 second sleep to the script (thought maybe it was returning its
status too quickly)
- futzed with the normal_check_interval for the heartbeat service on the
central server. gave it ranges between 1 minute and 15 minutes.
- enabled check_for_orphaned_services
- tossed a debug option in my stale_service script that sent a line of
output to a log, to make sure that the script itself was being run (it was)
- setting is_volatile (just to check)
- other things I can't think of right now.
And here's the service entry:
define service{
use qab24x7-service
service_description Heartbeat
hostgroup_name qabdbfohub
normal_check_interval 15
is_volatile 1
max_check_attempts 1
check_freshness 1
notification_interval 15
freshness_threshold 180
check_command stale-service
}
And here's some relevant log snippets:
[1048628465] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048628471] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048628705] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 58 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048628711] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048628945] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048628966] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048629185] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 42 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048629287] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048629485] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 21 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048629770] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048629965] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 20 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
[1048630715] SERVICE ALERT: sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
Heartbeat check is stale!
[1048630925] Warning: The results of service 'Heartbeat' on host
'sj-qab-db01' are stale by 31 seconds (threshold=180 seconds). I'm forcing
an immediate check of the service.
Sorry for the the long email/spam, but please oh please: does anyone have
any info regarding this problem?
Many thanks,
Erik Larkin
elarkin at nuasis.com
p.s. Just to go on record, I do think Nagios rocks. Hard. But this itty
bitty problem is driving me nuts! ;)
-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list