Someone explain could explain me the correct behaviour for freshness checkings please?
Andreas Ericsson
ae at op5.se
Mon Apr 11 15:13:33 CEST 2005
Artur D'Assumpção wrote:
> Hi,
>
> Could any one explain me one thing related to this thread. I've been
> trying a few different options trough the weekend, and i've tried this:
>
> both the central monitoring and distribuited server have
>
> service_freshness_check_interval=60
>
> and all services defined in the main configuration have a threshold of
> 999999 (the idea is "near" the infinite), What I was expecting was that
> each time the freshness check was triggered in the central server (each
> 60s) the threshold would allways validate the last service status
> received, because it never expires. What I am seeing is a litle
> diferent, each time the 60s gets triggered I have a UNKNOWN state setted
> (the check_command does this because it gets stalled) and later the
> services get to the real status with the 2min rate distributed server
> submitions. This behaviour loops each freshness check.
>
> Maybe I am interpreting the freshness behaviour wrong, maybe i'm
> configuring it wrong... can anyone give me a tip over here please?
>
RTFM. You should have done so from the beginning and saved yourself a
lot of anguish.
> i've already upgraded to latest version 2.0b3,
>
> Thanks,
>
> AD
>
>
> Artur D'Assumpção wrote:
>
>> Someone explain could explain me the correct behavior for freshness
>> checkings please? It's driving me crazy.
>>
>> The main configuration has:
>>
>> service_freshness_check_interval=60
>>
>> So I supose that this will define de check rate for the services
>> freshness check.
>>
>> Then, for every service I use the same template, where I have the
>> following configurations:
>>
>>
>> check_command service-is-stale
>> check_freshness 1
>> freshness_threshold 300
>> parallelize_check 1
>> max_check_attempts 2
>> normal_check_interval 2
>> retry_check_interval 2
>>
>> So the logical behavior for me, is that everytime nagios will trigger
>> a freshness check (each 60s in this case), if the last submited check
>> sample for a given service is more than 300s old it will declare
>> that service staled and run service-is-stale. Now, i'm pretty shure
>> that samples are being fed in a +-120s rate, and I'm having a lot of
>> status changes from OK to UNKNOWN (returned from the
>> service-is-stale)! Here it it goes some interestings logs:
>>
>> Apr 10 16:26:17 sr-0 nsca[13018]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load',
>> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
>> Apr 10 16:26:47 sr-0 nsca[30732]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return
>> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
>> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System
>> Load;1;WARNING - load average: 1.00, 1.00, 1.00
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk
>> Usage;0;DISK OK - free space: / 3692 MB (64%):
>> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap
>> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
>> Apr 10 16:27:37 sr-0 nsca[29813]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SRV] SSH', Return Code:
>> '0', Output: 'SSH OK - OpenSSH_3.9p1 (protocol 2.0)'
>> Apr 10 16:28:08 sr-0 nsca[17504]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Interfaces', Return
>> Code: '0', Output: 'OK - interfaces lo eth0 tun0 are up'
>> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SRV] SSH;0;SSH OK -
>> OpenSSH_3.9p1 (protocol 2.0)
>> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Interfaces;0;OK
>> - interfaces lo eth0 tun0 are up
>> Apr 10 16:28:17 sr-0 nsca[13184]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load',
>> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
>>
>> ---- SERVICES WHERE OK WHEN REACHED HERE ----
>> ---- SERVICES CHANGED TO UNKNOWN AFTER THIS NEXT BLOCK ----
>>
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS]
>> Disk Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
>> (threshold=500 seconds). I'm forcing an immediate check of the service.
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS]
>> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
>> (threshold=500 seconds). I'm forcing an immediate check of the service.
>> Apr 10 16:28:47 sr-0 nsca[10633]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return
>> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
>> Apr 10 16:29:07 sr-0 nsca[6978]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System
>> Load;1;WARNING - load average: 1.00, 1.00, 1.00
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk
>> Usage;0;DISK OK - free space: / 3692 MB (64%):
>> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap
>> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
>>
>> The last 2nd block of logs, and correct me if i'm wrong, shows me that
>> something is not ok here, first of all services are being considered
>> staled near 2 mins after a submited check:
>>
>> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name:
>> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return
>> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>>
>> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS]
>> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds
>> (threshold=500 seconds). I'm forcing an immediate check of the service.
>>
>> Then i'm looking to a 500s threshold and 40s stale that i've never
>> defined, and I'm shure of this, because all my objects, and they're
>> are very few for this testing environment, uses the same template that
>> i've shown before. Could be this any default value that is not being
>> overided? If it is, I can't find any reference to it in the
>> documentation.
>>
>> I'm using nagios 2.0b.
>>
>> I'd be very thankfull with some help in this subject please.
>>
>> AD
>>
>>
>>
>>
>>
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when
>> reporting any issue. ::: Messages without supporting info will risk
>> being sent to /dev/null
>
>
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. ::: Messages without supporting info will risk
> being sent to /dev/null
>
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Lead Developer
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list