Trouble with Nag-1.0/ePN/check_by_ssh: check returns UNKNOWN status _in_ Nagios.
Stanley Hopcroft
Stanley.Hopcroft at IPAustralia.Gov.AU
Sat Feb 22 10:00:05 CET 2003
Dear Ladies and Gentlemen,
I am writing to seek advice about how to deal with an intermittent
problem with check_by_ssh (a version from the CVS, <= 10 days
ago: check_by_ssh (nagios-plugins 1.3.0-beta2) 1.9).
The context:
Nagios 1.0/ePN/Perl 5.005_03
FreeBSD 4.7_RELEASE
PIII 850 + 256 MB . 192 hosts + 309 active checks. Load average <=
0.20. Latency <= 16 secs.
The problem:
check_by_ssh is connecting to an AIX v4 host to run a priviledged
/bin/sh script that checks Oracle database 'connectivity' (probably
using sql+). The check is coded according to Nag guidelines and its
author assures me it only exits with 0 and 2 (no warning, no unknown).
This check is run by sudo on AIX, so the complete check_by_ssh command
is
%/usr/local/nagios/libexec/check_by_ssh -t 60 -H oradev -C
'/usr/local/bin/sudo -u netstmq /home/local/netsaint/db_check/db_check
2>/dev/null'
all databases ok
%echo $?
0
services.cfg:
define service{
use generic-service
host_name oradev
service_description DB Connectivity
contact_groups oracle-admins
normal_check_interval 30
check_command
check_by_ssh4!60!/usr/local/bin/sudo -u netstmq
/home/local/netsaint/db_check/db_check 2>/dev/null
}
checkcommands.cfg:
# 'check_by_ssh4' command definition
define command{
command_name check_by_ssh4
command_line $USER1$/check_by_ssh -t $ARG1$ -H $HOSTNAME$ -C
'$ARG2$'
# command_line $USER1$/check_by_ssh -t $ARG1$ -H pc09011 -C
'$ARG2$'
}
Now, I have never seen it return (running the command above from the
Nagios host CLI logged in as the Nagios user) other than OK and CRITICAL
return codes, yet Nagios as I write, reports that the return code is
UNKNOWN completely contradicting what I see from the CLI (above).
In addition, a -HUP signal to Nagios usually triggers an UNKNOWN state,
while a Nagios stop/start cycle is the only way to clear it.
A debugging Nagios (--enable_DEBUG3 plus the other usual configure
options) performs somewhat differently in that the rate of UNKNOWN
results is much less, and -HUP only produces an UNKNOWN from the first
check (recovers on the first retry).
It also shows
Found check result for service 'DB Connectivity' on host
'oradev'
Check Type: ACTIVE
Parallelized?: Yes
Exited OK?: Yes
Return Status: 3
Plugin Output: 'all databases ok'
Service 'DB Connectivity' on host 'oradev' has changed
state since last check!
Raw Command: check-host-alive
Processed Command: /usr/local/nagios/libexec/check_ping
10.0.100.10 100 100 5000.0 5000.0 -p 1
Host Check Result: Host 'oradev' is UP
Host Check Result: Host 'oradev' is UP
Raw global service event handler command
line: $USER1$/global_svc_handler $TIMET$ $HOSTNAME$ '$SERVICEDESC$'
$SERVICESTATE$ $STATETYPE$ '$OUTPUT$'
Processed global service event handler command
line: /usr/local/nagios/libexec/global_svc_handler 1045809853 oradev 'DB
Connectivity' UNKNOWN SOFT 'all databases ok'
It seems then that the check is occasionally returning UNKNOWN states,
without the check abending (and therefore giving Nag an opportunity to
set a default value).
Since check_by_ssh is used by this Nagios successfully for other
services, the only explanation I can think of is that 'sudo' is
malfunctioning in some way.
Your comments about how to proceed are most welcome: gdb, truss ?
Yours sincerely.
--
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------
'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'
from Meditation 17, J Donne.
-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list