[patch] Workaround for 'Host DOWN' false-positives
Bruce Campbell
nagios-devel at vicious.dropbear.id.au
Sun May 21 13:26:49 CEST 2006
On Sat, 20 May 2006, Jan Kratochvil wrote:
> script for using service "Connectivity" to detect HOST-DOWN/UP states:
>
> Attached script delegates the host-alive checking to the standard Nagios
> services checking and if the service check will detected after some time
Great idea. Some serious scaling issues in the way you've gone about it
though (which, for those who didn't follow the explanation, involves the
host check command scanning the status.dat file for the output of the
'Connectivity' or 'SSH' service check to return).
To be more precise, you are reading in the complete status.dat and
objects.cache files each time this script is being run. Some
installations have these files over the 10 meg mark, and I suspect that
reading in the file each time a host check is run might well be a little
bit noticeable, particularly when using an interpreted language and a lot
of hosts.
Rather than having your host check command do a lot of work and possibly
hit the memory, disk and cpu too hard due to Nagios' periodic obsession
with repetively checking the status of the host, get the service check
command to do just a little extra bit of work, and submit the host check
result to Nagios when it runs, leaving the host check to simply do a
lookup inside Nagios.
( Note, all of the following has been quickly typed up after the influence
of a rather late night. I could be completely and utterly wrong )
For instance, try this script as a service check:
#!/bin/sh
# host_check_wrapper.sh
# Call with: host_check_wrapper.sh $HOSTNAME$ $COMMANDFILE$ $USER1$/normal_check_command args
shost=$1
shift
ncmdf=$1
shift
# Run the remaining command and record the output text.
result=`"$@"`
# Record the exit code.
state=$?
# Submit the result to the Nagios (external) command file
if [ -p "$ncmdf" -a -w "$ncmdf" ] ; then
echo "[`date +%s`] PROCESS_HOST_CHECK_RESULT;$shost;$state;$result" > $ncmdf
fi
# Return the result to Nagios.
echo "$result"
exit $state
And the Nagios definitions would be:
define command {
command_name host_check_wrapper
command_line $USER1$/host_check_wrapper.sh $HOSTNAME$ $COMMANDFILE$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$
}
# Run the service check fairly frequently.
define service {
host_name some_host
service_description Connectivity
check_command host_check_wrapper!$USER1$/check_ping!-w!100.0,20%!-c!500.0,60%
normal_check_interval 2
etc...
}
And finally, define your host as follows: do not perform active checking,
accept passive results, check the freshness of results such that anything
within the last 20 minutes is valid, and define a fallback command:
define host {
host_name foo.example.com
address 1.2.3.4
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
max_check_attempts 5
check_interval 2
freshness_threshold 1200
check_command check_dummy!2!Host assumed unreachable
}
Define the check_dummy command command. This is a plugin that comes
standard with Nagios. This simply returns the integer given as the first
argument, and the reason given as the second argument. In this set-up,
we're using it to issue an alert if the host's passive check result has
not been received for 20 minutes.
define command {
command_name check_dummy
command_line $USER1$/check_dummy $ARG1$ $ARG2$
}
--
Bruce Campbell
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
More information about the Developers
mailing list