More info/leads: Huge delay in scheduling freshness service chec k after 3rd try
Erik Larkin
erik.larkin at nuasis.com
Thu Mar 27 18:34:38 CET 2003
Wow, thanks for the reply. I'm pretty impressed that you're responding to
questions from the user's list.
However, I don't think I provided enough information to give the full
picture for that service check. I have active checks disabled server-wide,
so nagios is only checking that service in response to a stale freshness
check. I do want it to perform the freshness check every 3 minutes (and I
do have the interval_length at its default), since the distributed servers
are supposed to send a heartbeat every minute. I originally had the
normal_check_interval set to 1 minute, and if I remember correctly, that
caused nagios to schedule an active check every minute in response to a
failed freshness check. Thinking that perhaps this was messing up the
scheduling, I've changed the normal_check_interval a few times, and it's
current value is still 15. This seems to mean that the freshness
failes/forced check will always kick in before the 'regularly' scheduled
check, 3 minutes v. 15 minutes.
At any rate, the real problem is this: the freshness check correctly
recognizes the service check as stale after four minutes or so, and says
it's forcing a check of the service. Since that check only calls the
ubiquitous stale-service script, the response should only take a few
seconds. The first few log entries below reflect this, with a SERVICE ALERT
in response to the stale-service script initially coming about 6 seconds
after the freshness check warning. However, when the service remains stale,
subsequent forced checks (ie. executions of the stale-service script) begin
taking longer and longer. After the third check or so, latency increases to
99, then 280, then 749, on up to over 32000 at last run (seems to be a rough
multiplier of 3, don't know if that's significant). The latency reported
for the service does match the time delay between the "Warning" message from
the freshness check and the "SERVICE ALERT" response from the stale-service
script. I've also verified that this is in fact latency in executing the
stale-service script by adding a debug line in the script that logs the time
the script is executed in a heartbeat.log, and those times match the
'SERVICE ALERT' log entries (with a few seconds delay).
Is this exponentially-increasing latency because of a conflict between my
freshness_interval and the normal_check_interval? I thought it might be,
and tried a few different normal_check_interval settings, but I couldn't
change this problem behavior.
Thanks again for your reply.
Erik
From: "Ethan Galstad" <nagios at nagios.org>
To: nagios-users at lists.sourceforge.net
Date: Wed, 26 Mar 2003 20:50:57 -0600
Subject: Re: [Nagios-users] More info/leads: Huge delay in scheduling
freshness service chec k after 3rd try
If you haven't change the interval_length directive from its default
value of 60, the problem lies in your config. In your service
definition you are telling Nagios to actively check the service every
15 minutes. However, the freshness threshold for that service is set
at 3 minutes (180 seconds). About once every minute the freshness
checks get performed, so that's why you're seeing staleness warning
approximately every 4 minutes.
On 26 Mar 2003 at 10:18, Erik Larkin wrote:
>
> FYI, after a little more research, I think I've narrowed it down to an
issue
> with the scheduling queue. I tossed a debug option in my stale-service
> script that logs the time it's called. Then I cross-referenced those
times
> with the times that nagios logged a failed freshness check, and the times
> that nagios received the response from the stale-service script. The time
> difference between when the script is actually called and when nagios logs
> the script response is maybe a few seconds, leading me to believe that the
> service reaper is ok. However, the delay between when nagios says it
failed
> a freshness check and is forcing a service check, and when the
stale-service
> script is actually called, was over 32000 seconds at last failure. So,
I'm
> now focusing on problems with the scheduling queue. Any ideas, anyone?
>
> -----Original Message-----
> From: Erik Larkin
> Sent: Tuesday, March 25, 2003 2:44 PM
> To: 'nagios-users at lists.sourceforge.net'
> Subject: Huge delay in scheduling freshness service check after 3rd try
>
>
>
> Allright, I'm finally admitting that I can't figure this one out myself.
> Trust me, it's a difficult admission, and has involved much tinkering,
> hair-pulling, and searching of mailing lists (although searching doesn't
> seem to be working right now for the sourceforge lists?).
>
> Anyways, I've got a nagios architecture with multiple distributed servers
> sending check results to a central nagios server via ncsa. The central
> server doesn't perform any active checks (no network access to the
> distributed network), but is configured to perform a freshness check for a
> service called 'Heartbeat' for each distributed instance. The heartbeat
is
> just a ping of the loopback performed every minute, although I've since
> discovered I could have used check_dummy. Seems to be a pretty common
> setup, and for the most part it works very well.
>
> Except for the freshness checks. They work fine up until the 3rd failed
> freshness check or so, at which point latency skyrockets. From 99 to 280
> to 749, on up to thousands and thousands of seconds of latency. The log
> reflects a failed freshness check, and a message about forcing the service
> (which is the typical echo and exit 2). But the service alert response is
> delayed more and more. I've tried everything I can think of, and learned
a
> great deal in my searching and tweaking, but I can't change this behavior.
>
> Here's what I've tried:
>
> - change the service_reaper_frequency to 3. saw a reference for this in
the
> list for something else, thought it might help. I still suspect some
> problem with the service reaper.
> - added a 1 second sleep to the script (thought maybe it was returning its
> status too quickly)
> - futzed with the normal_check_interval for the heartbeat service on the
> central server. gave it ranges between 1 minute and 15 minutes.
> - enabled check_for_orphaned_services
> - tossed a debug option in my stale_service script that sent a line of
> output to a log, to make sure that the script itself was being run (it
was)
> - setting is_volatile (just to check)
> - other things I can't think of right now.
>
> And here's the service entry:
>
> define service{
> use qab24x7-service
> service_description Heartbeat
> hostgroup_name qabdbfohub
> normal_check_interval 15
> is_volatile 1
> max_check_attempts 1
> check_freshness 1
> notification_interval 15
> freshness_threshold 180
> check_command stale-service
> }
>
>
> And here's some relevant log snippets:
>
>
> [1048628465] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048628471] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048628705] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 58 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048628711] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048628945] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 57 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048628966] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629185] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 42 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048629287] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629485] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 21 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048629770] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048629965] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 20 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
> [1048630715] SERVICE ALERT:
sj-qab-db01;Heartbeat;CRITICAL;HARD;1;CRITICAL:
> Heartbeat check is stale!
> [1048630925] Warning: The results of service 'Heartbeat' on host
> 'sj-qab-db01' are stale by 31 seconds (threshold=180 seconds). I'm
forcing
> an immediate check of the service.
>
>
>
> Sorry for the the long email/spam, but please oh please: does anyone have
> any info regarding this problem?
>
>
> Many thanks,
>
> Erik Larkin
> elarkin at nuasis.com
>
> p.s. Just to go on record, I do think Nagios rocks. Hard. But this itty
> bitty problem is driving me nuts! ;)
-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list