Someone explain could explain me the correct behaviour for freshness checkings please?

Andreas Ericsson ae at op5.se
Mon Apr 11 15:13:33 CEST 2005
Previous message: Someone explain could explain me the correct behaviour for freshness checkings please?
Next message: check_by_ssh-issue(remote host monitoring)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Artur D'Assumpção wrote:
> Hi,
> 
> Could any one explain me one thing related to this thread. I've been 
> trying a few different options trough the weekend, and i've tried this:
> 
> both the central monitoring and distribuited server have
> 
> service_freshness_check_interval=60
> 
> and all services defined in the main configuration have a threshold of 
> 999999 (the idea is "near" the infinite), What I was expecting was that
> each time the freshness check was triggered in the central server (each 
> 60s) the threshold would allways validate the last service status 
> received, because it never expires. What I am seeing is a litle 
> diferent, each time the 60s gets triggered I have a UNKNOWN state setted 
> (the check_command does this because it gets stalled) and later the 
> services get to the real status with the 2min rate distributed server 
> submitions. This behaviour loops each freshness check.
> 
> Maybe I am interpreting the freshness behaviour wrong, maybe i'm 
> configuring it wrong... can anyone give me a tip over here please?
> 

RTFM. You should have done so from the beginning and saved yourself a 
lot of anguish.

> i've already upgraded to latest version 2.0b3,
> 
> Thanks,
> 
> AD
> 
> 
> Artur D'Assumpção wrote:
> 
>> Someone explain could explain me the correct behavior for freshness 
>> checkings please? It's driving me crazy.
>>
>> The main configuration has:
>>
>> service_freshness_check_interval=60
>>
>> So I supose that this will define de check rate for the services 
>> freshness check.
>>
>> Then, for every service I use the same template, where I have the 
>> following configurations:
>>
>>
>>    check_command        service-is-stale
>>    check_freshness        1
>>    freshness_threshold    300
>>    parallelize_check        1
>>    max_check_attempts    2
>>    normal_check_interval           2
>>    retry_check_interval            2
>>
>> So the logical behavior for me, is that everytime nagios will trigger 
>> a freshness check (each 60s in this case), if the last submited check 
>> sample for a given service is  more than 300s old  it will  declare 
>> that service staled and  run  service-is-stale. Now, i'm pretty shure 
>> that samples are being fed in a +-120s rate, and I'm having a lot of 
>> status changes from OK to UNKNOWN (returned from the 
>> service-is-stale)! Here it it goes some interestings logs:
>>
>> Apr 10 16:26:17 sr-0 nsca[13018]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', 
>> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
>> Apr 10 16:26:47 sr-0 nsca[30732]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return 
>> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
>> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System 
>> Load;1;WARNING - load average: 1.00, 1.00, 1.00
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk 
>> Usage;0;DISK OK - free space: / 3692 MB (64%):
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap 
>> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
>> Apr 10 16:27:37 sr-0 nsca[29813]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SRV] SSH', Return Code: 
>> '0', Output: 'SSH OK - OpenSSH_3.9p1 (protocol 2.0)'
>> Apr 10 16:28:08 sr-0 nsca[17504]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Interfaces', Return 
>> Code: '0', Output: 'OK - interfaces lo eth0 tun0 are up'
>> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SRV] SSH;0;SSH OK - 
>> OpenSSH_3.9p1 (protocol 2.0)
>> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Interfaces;0;OK 
>> - interfaces lo eth0 tun0 are up
>> Apr 10 16:28:17 sr-0 nsca[13184]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', 
>> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
>>
>> ---- SERVICES WHERE OK WHEN REACHED HERE ----
>> ---- SERVICES CHANGED TO UNKNOWN AFTER THIS NEXT BLOCK ----
>>
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
>> Disk Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
>> (threshold=500 seconds).  I'm forcing an immediate check of the service.
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
>> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
>> (threshold=500 seconds).  I'm forcing an immediate check of the service.
>> Apr 10 16:28:47 sr-0 nsca[10633]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return 
>> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
>> Apr 10 16:29:07 sr-0 nsca[6978]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System 
>> Load;1;WARNING - load average: 1.00, 1.00, 1.00
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk 
>> Usage;0;DISK OK - free space: / 3692 MB (64%):
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap 
>> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
>>
>> The last 2nd block of logs, and correct me if i'm wrong, shows me that 
>> something is not ok here, first of all services are being considered 
>> staled near 2 mins after a submited check:
>>
>> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name: 
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>>
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
>> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
>> (threshold=500 seconds).  I'm forcing an immediate check of the service.
>>
>> Then i'm looking to a 500s threshold and 40s stale that i've never 
>> defined, and I'm shure of this, because all my objects, and they're 
>> are very few for this testing environment, uses the same template that 
>> i've shown before. Could be this any default value that is not being 
>> overided? If it is, I can't find any reference to it in the 
>> documentation.
>>
>> I'm using nagios 2.0b.
>>
>> I'd be very thankfull with some help in this subject please.
>>
>> AD
>>
>>
>>
>>
>>
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when 
>> reporting any issue. ::: Messages without supporting info will risk 
>> being sent to /dev/null
> 
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting any issue. ::: Messages without supporting info will risk 
> being sent to /dev/null
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Someone explain could explain me the correct behaviour for freshness checkings please?
Next message: check_by_ssh-issue(remote host monitoring)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list