eventhandlers running when a dependent service dependency is not satisfied
Eli Stair
estair at ilm.com
Fri Dec 9 22:57:04 CET 2005
Thanks a million for pointing out the 'SCHEDULE_FORCED_SVC_CHECK', I'm
now rewriting and testing the event handlers to take care of this. If
only there were a macro/variable of the master service... looking for a
lightweight way to determine the <service_description> to pass to the
macro that is the direct parent of the check that just failed.
WRT the SSH/SNMP dependency issue, I have a feeling that I'm missing
something here altogether, or didn't include enough info in my initial
report, as both you and Hugo mentioned a possible issue with this.
To be clear, I'm doing this only so that if a dependent service IS down
(Ganglia) and SNMP has been shown to be up (after
'SCHEDULE_FORCED_SVC_CHECK',) I need to (or want to) make sure that SSH
is running before attempting to connect. There are enough failure modes
that occur causing SSH to die at the same time as other services that I
want to avoid a bunch of high-latency/timeout/CPU event handlers running
if they are bound to fail.
Thanks for the accurate pointer to that macro,
Cheers,
/eli
Here's the output of view config showing that it is configured the way I
think... just not sure if that is something I don't want to do :)
Host Service Host Service Dependency Type Dependency Failure Options
deathstar1001 SNMP-- Ganglia running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- Ganglia running deathstar1001 SNMP Check
Execution Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- NTP running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- NTP running deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- cron running deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- cron running deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- automounter running 4 instances deathstar1001
SNMP Notification Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- automounter running 4 instances deathstar1001
SNMP Check Execution Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- load -lt 4 deathstar1001 SNMP Notification
Warning, Unknown, Critical, Pending
deathstar1001 SNMP-- load -lt 4 deathstar1001 SNMP Check Execution
Warning, Unknown, Critical, Pending
deathstar1001 SNMP deathstar1001 SSH Notification Warning, Unknown,
Critical, Pending
deathstar1001 SNMP deathstar1001 SSH Check Execution Warning, Unknown,
Critical, Pending
John P. Rouillard wrote:
> Hi Eli:
>
> You didn't say what version of nagios you are running so I'll assume
> 2.0.
>
> In message <439912BC.5020000 at ilm.com>,
> Eli Stair writes:
>
>>The question comes down to this:
>>
>> Should a failed service check for a dependent trigger a check of its
>>parent before continuing?
>
>
> IIRC from the code it does not force a check of the parent service. I
> can see arguments for and against forcing a poll of the parent. Also
> the documentation:
>
> http://nagios.sourceforge.net/docs/2_0/dependencies.html
>
> in the "How Service Dependencies Are Tested" section, says:
>
> Nagios gets the current status of the service that is being depended upon.
>
> not nagios repolls the service being depended upon. A footnote
> says:
>
> by default, Nagios will use the most current hard state of the
> service(s) that is/are being depended upon
>
> an option in the config file will allow it to use the current soft
> state instead. I use the soft state of the service being depended upon
> myself.
>
>
>>If this is not the case, or default, is there _ANY_ way to implement this?
>
>
> Sort of. The event handler for the child can send a
> SCHEDULE_FORCED_SVC_CHECK external command for the parent specifying
> the current time in seconds. See
>
> http://www.nagios.org/developerinfo/externalcommands/commandinfo.php?command_id=129
>
> for details. The command will be acted upon immediately since nagios
> reads the external command file after an event handler runs. Use this
> to force an update of the current service status for the parent. Parse
> through the objects.cache (probably in /var/log/nagios/objects.cache)
> file for the expanded servicedependency objects to find the service
> dependencies that match your host/service.
>
> I set my nagios options so that:
>
> max_check_attempts(dependent)*retry_check_interval(dependent) >
> normal_check_interval(parent)
>
> This way the parent service will be checked at least once during the
> soft error interval of the dependent service.
>
>
>>I want to avoid at all costs having an every-minute check of the parent
>>processes on many thousand hosts just to keep from having the child
>>process checks and event handlers going hay-wire.
>
>
> You need to use the max_check_attempts to provide a buffer in which
> the parent service will be checked. You can have your event handler
> submit an external command on the first soft error and try to fix the
> problem on a subsequent soft, or hard error. You don't have any of
> those directives in your sample config.
>
>
>>I want a dependency chain like this:
>>
>> SSH -- SNMP --\
>> - Ganglia
>> - NTP
>
>
> Just a note, I wouldn't have ssh in the dependency chain unless you
> are accessing snmp over ssh (e.g. running check_snmp via
> check_by_ssh). I can't tell if that is the case or not. Just because
> your event handler runs over ssh doesn't add it to the dependency
> chain IMO. If ssh is down, it means none of the other services will be
> checked and you won't recognize them as down.
>
>
>>I believe I have this set up so that a service check for SNMP is
>>dependent on the SSH service running.
>
>
> Did you verify in the web interface or object.cache?
>
>
>>In turn, the service checks for
>>other processes that use SNMP are dependent on SNMP running. My intent
>>is that service checks for NTP,etc will not be attempted if its parent
>>SNMP process is not in an OK state (as I have an event handler that will
>>restart SNMP if it is dead). If the parent SNMP _IS_ running, then the
>>child process checks (Ganglia, NTP, etc) will be checked and if dead
>>their own event handler will activate.
>
>
> It looks like the config is ok on that score with one possible
> exception noted below.
>
>
>>The problem is that in this case, if I kill off SNMP the child process
>>checks STILL execute and return a CRITICAL. As a result, nagios fires
>>off the event handler for all these checks which results in an SSH out
>>to the nodes in question and restarting a bunch of services that are
>>probably still running. It SHOULD NOT schedule the child checks and
>>thus not run their event handlers until AFTER a new parent check has
>>returned executed and returned successfully, correct?
>
>
> Nope, nagios doesn't re-run the parent or parents. If you are in a
> soft failure mode, you can write your event handler to wait until you
> are in a hard failure mode.
>
>
>>I've included a dependency example below, and a snip from the nagios log
>>showing it sequentially hammering out checks of all the child processes
>>at the same time it already knows the parent is dead.
>>[...]
>>###################################################
>>### snip of this host/group definition include:
>>define host{
>> use linux-node-production
>> host_name HOSTNAME1
>> address IP
>>}
>>
>>define servicedependency{
>> host_name HOSTNAME1
>> service_description SSH
>> dependent_host_name HOSTNAME1
>> dependent_service_description SNMP
>> execution_failure_criteria w,p,u,c
>> notification_failure_criteria w,p,u,c
>> inherits_parent 1
>>}
>>
>>define servicedependency{
>> host_name HOSTNAME1
>> service_description SNMP
>> dependent_host_name HOSTNAME1
>> dependent_service_description SNMP--*
>
>
> Not sure if SNMP--* does what you think (and I hope) it does. Have you
> looked at the view config web page and verified that nagios is seeing
> the appropriate service dependencies?
>
> -- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list