Someone explain could explain me the correct behaviour for freshness checkings please?
Artur D'Assumpção
artur.dassumpcao at di.com.pt
Sun Apr 10 17:38:42 CEST 2005
Someone explain could explain me the correct behavior for freshness
checkings please? It's driving me crazy.
The main configuration has:
service_freshness_check_interval=60
So I supose that this will define de check rate for the services
freshness check.
Then, for every service I use the same template, where I have the
following configurations:
check_command service-is-stale
check_freshness 1
freshness_threshold 300
parallelize_check 1
max_check_attempts 2
normal_check_interval 2
retry_check_interval 2
So the logical behavior for me, is that everytime nagios will trigger a
freshness check (each 60s in this case), if the last submited check
sample for a given service is more than 300s old it will declare that
service staled and run service-is-stale. Now, i'm pretty shure that
samples are being fed in a +-120s rate, and I'm having a lot of status
changes from OK to UNKNOWN (returned from the service-is-stale)! Here it
it goes some interestings logs:
Apr 10 16:26:17 sr-0 nsca[13018]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', Return
Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
Apr 10 16:26:47 sr-0 nsca[30732]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return
Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System
Load;1;WARNING - load average: 1.00, 1.00, 1.00
Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk Usage;0;DISK
OK - free space: / 3692 MB (64%):
Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap Usage;0;SWAP
OK: 100% free (494 MB out of 494 MB)
Apr 10 16:27:37 sr-0 nsca[29813]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SRV] SSH', Return Code:
'0', Output: 'SSH OK - OpenSSH_3.9p1 (protocol 2.0)'
Apr 10 16:28:08 sr-0 nsca[17504]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Interfaces', Return
Code: '0', Output: 'OK - interfaces lo eth0 tun0 are up'
Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SRV] SSH;0;SSH OK -
OpenSSH_3.9p1 (protocol 2.0)
Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Interfaces;0;OK -
interfaces lo eth0 tun0 are up
Apr 10 16:28:17 sr-0 nsca[13184]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', Return
Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
---- SERVICES WHERE OK WHEN REACHED HERE ----
---- SERVICES CHANGED TO UNKNOWN AFTER THIS NEXT BLOCK ----
Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] Disk
Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
(threshold=500 seconds). I'm forcing an immediate check of the service.
Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] Swap
Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
(threshold=500 seconds). I'm forcing an immediate check of the service.
Apr 10 16:28:47 sr-0 nsca[10633]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return
Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
Apr 10 16:29:07 sr-0 nsca[6978]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System
Load;1;WARNING - load average: 1.00, 1.00, 1.00
Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk Usage;0;DISK
OK - free space: / 3692 MB (64%):
Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap Usage;0;SWAP
OK: 100% free (494 MB out of 494 MB)
The last 2nd block of logs, and correct me if i'm wrong, shows me that
something is not ok here, first of all services are being considered
staled near 2 mins after a submited check:
Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name:
'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] Swap
Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
(threshold=500 seconds). I'm forcing an immediate check of the service.
Then i'm looking to a 500s threshold and 40s stale that i've never
defined, and I'm shure of this, because all my objects, and they're are
very few for this testing environment, uses the same template that i've
shown before. Could be this any default value that is not being
overided? If it is, I can't find any reference to it in the documentation.
I'm using nagios 2.0b.
I'd be very thankfull with some help in this subject please.
AD
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list