freshness_threshold bug - big problem
Rodney Ramos
rodneyra at gmail.com
Fri Dec 17 12:10:32 CET 2010
Dear Jochen,
Than I understood that you confirm the problem, as your configuration was:
check_interval 15, retry_interval 2 and max_check_attempts 4.
And from your log we have:
18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
(threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)
18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
(threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
--> It´s wrong. It should be about 18:42:05, 2 minutes after the SOFT1, as
your retry_interval is 2 minutes.
19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated
--> It´s wrong. It should be about 18:58:23, 2 minutes after the SOFT2, as
your retry_interval is 2 minutes.
19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
(threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated
--> It´s wrong. It should be about 19:30:23, 2 minutes after the SOFT3, as
your retry_interval is 2 minutes.
I´d like to know if the Nagios Core developers have already realized this
problem and if they are intending to correct it for the next release or
making a patch.
Thanks,
Rodney
On Thu, Dec 16, 2010 at 6:59 PM, Jochen Bern <Jochen.Bern at linworks.de>wrote:
> On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> > As I´ve said before I think that it is a Nagios Core bug. I´ve tested it
> > with Nagios 3.2.1 and I found the same problem.
> > I think it´s a serious problem.
>
>
> Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
> *more* of a problem with host freshness checks. Test run with
> check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:
>
>
> 18:23:55 Warning: Host 'Unfresh' has no services associated with it!
> 18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
> Init to UP|
> 18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP
>
> 18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
> (threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
> 18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)
>
> 18:51:12 Warning: Host 'Unfresh' has no services associated with it!
>
> 18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
> (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
> 18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
> 19:00:12 Warning: Host 'Unfresh' has no services associated with it!
> 19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
> (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
> 19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
> terminated
> 19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
> (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
> terminated
> 19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
> (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
> terminated
> 20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
> (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
> (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
> 20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
> (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
> 20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
> (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
> 21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
> (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
>
>
> (The additional "no services" crud stems from my not getting the check
> command right the first time 'round, and having to re-reload the config.)
>
>
> I took excerpts of status.dat and retention.dat initially and after the
> first nine active checks, look at these current_attempt numbers:
>
>
> # for FIL in *.dat* ; do echo -n "${FIL}: " | \
> > sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> > egrep '(current_attempt|state_type|(current|last_hard)_state=)' \
> > $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=]\)/\1\2/g' | \
> > tr '\n\t' ' ' ; echo "" ; done
> retention.dat-OK: cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
> retention.dat-1: cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
> retention.dat-2: cur_sta=1 las_har_sta=0 cur_att=1 sta_typ=0
> retention.dat-3: cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-4: cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-5: cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> retention.dat-6: cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-7: cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-8: cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> retention.dat-9: cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> status.dat-OK: cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
> status.dat-1: cur_sta=1 las_har_sta=0 cur_att=1 sta_typ=0
> status.dat-2: cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> status.dat-3: cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
> status.dat-4: cur_sta=1 las_har_sta=0 cur_att=3 sta_typ=0
> status.dat-5: cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
> status.dat-6: cur_sta=1 las_har_sta=1 cur_att=1 sta_typ=1
> status.dat-7: cur_sta=1 las_har_sta=1 cur_att=1 sta_typ=1
> status.dat-8: cur_sta=1 las_har_sta=1 cur_att=1 sta_typ=1
> status.dat-9: cur_sta=1 las_har_sta=1 cur_att=1 sta_typ=1
>
>
> extinfo.cgi told me "1/4 (SOFT state)" at 19:03 (after the *2nd* active
> check, i.e., matching the data in retention.dat) but tells me "1/4 (HARD
> state)" right now (matching status.dat instead) ...
>
>
> Kind regards,
> J. Bern
> --
> Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
> Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
> PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
> Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
> Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20101217/cb3c86ef/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Developers
mailing list