passive service checks with 1 second interval
Risto Vaarandi
risto.vaarandi at seb.ee
Fri Aug 10 13:43:25 CEST 2007
hi all,
yesterday I attempted to implement passive checks for a volatile service
with 1 second interval (i.e., once a second, the status of a service is
written to Nagios command file), but I am experiencing some problems
with how the service status is displayed (and notifications). Since I
haven't implemented such checks before, I'd like to consult with more
experienced users if Nagios alone is suitable for monitoring externally
submitted checks with such a short interval.
If the service is up, the Nagios log shows that it reads the status
without any delay from its command file:
[1186719368] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719368
[1186719369] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719369
[1186719370] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719370
[1186719371] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719371
[1186719372] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;0;node03 up at 1186719372
However, then the service goes to a critical state:
[1186719373] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719373
and starting from this moment, external checks are read from command
file with 9-10 second intervals, with a "service alert" and notification
at the end of each activity burst:
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719374
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719375
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719376
[1186719384] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at 1186719377
[1186719384] SERVICE ALERT:
node03;NodeState;CRITICAL;HARD;1;node03 DOWN at 1186719373
Then the service goes up, and the after a while I am seeing the
following log entries:
[1186719447] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;node03 up at 1186719447
[1186719447] Warning: The results of service 'NodeState' on host
'node03' are stale by 11 seconds (threshold=60 seconds). I'm forcing an
immediate check of the service.
I am the freshness checks enabled, and the the service status is
reported as stale, although it was reported as normal shortly before.
As a result, I am seeing service notifications with wrong timestamps -
the notifications appear after 18 second intervals, although the DOWN
service checks are submitted after 1 second intervals. In addition, the
service status is reported as stale after it has gone up.
Is there a way to speed up the processing of CRITICAL service checks?
I'd like to get a notification within the same second.
br,
risto
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list