Event handler OK from command line, not from nagios
Glenn A. Meisenheimer
gmeisenheimer at itgroundwork.com
Wed Jul 7 23:31:44 CEST 2004
Friends,
My problem is this. The print spooler on a Win 2k machine keeps hanging up.
Remedial action typically requires a sysadmin to stop and restart the
print spooler
on the windows box (There are multiple boxes, btw).
So I generated this .BAT file on the windows host:
ECHO. >> c:\nrpe-nt\rspooler.log
ECHO |DATE |find "current" >> c:\nrpe-nt\rspooler.log
ECHO |TIME |find "current" >> c:\nrpe-nt\rspooler.log
ECHO Resetting the Print Spooler >> c:\nrpe-nt\rspooler.log
NET STOP SPOOLER >> c:\nrpe-nt\rspooler.log
NET START SPOOLER >> c:\nrpe-nt\rspooler.log
If the logfile rspooler.log already exists this script appends a
date/time stamp,
echos that it is resetting the spooler, then redirects the output from the
NET STOP and NET START command into that log file as well.
This gives us a record of restarts.
I have installed nrpe-nt on the windows box, configured nrpe.cfg thusly:
command[check_rspooler]=C:\nrpe-nt\rspooler.bat
So if check_nrpe on the Nagios server calls check_rspooler on the win 2k
box, it should run the rspooler.bat script listed above.
On the Nagios server check_nrpe is configured like this:
#
# NRPE Command
define command{
command_name check_nrpe
command_line /usr/local/nagios/libexec/check_nrpe -H
$HOSTADDRESS$ -c $ARG1$
}
And indeed if , from the command line I type:
/usr/local/nagios/libexec/check_nrpe -H 192.168.1.31 -c check_rspooler
The rspooler.bat file on the win 2k Box does run, and does log the event.
So what I need to do is write an event handler which calls rspooler.bat
using the check_nrpe
command above. That event handler is located in
/usr/local/nagios/eventhandler and is
named reset_spooler_04. It is owned by nagios/nagios.
The reset_spooler_04 event handler script follows:
cut here
****************************************************************
#!/bin/bash
#
# Event handler script for executing nrpe recovery scripts on a remote
machine
#
# Note: This script will only execute a recovery script if the service is
# retried 3 times (in a "soft" state) or if the associated monitor somehow
# manages to fall into a "hard" error state.
#
# What state is the service in?
case "$1" in
OK)
# The service just came back up, so don't do anything...
;;
WARNING)
# We don't really care about warning states, since the service is
probably still running...
;;
UNKNOWN)
# We don't know what might be causing an unknown error, so don't do
anything...
;;
CRITICAL)
# Aha! The service appears to have a problem - perhaps we should
run the recovery script...
# Is this a "soft" or a "hard" state?
case "$2" in
# We're in a "soft" state, meaning that Nagios is in the
middle of retrying the
# check before it turns into a "hard" state and contacts get
notified...
SOFT)
# What check attempt are we on? We don't want to restart
the web server on the first
# check, because it may just be a fluke!
case "$3" in
# Wait until the check has been tried 4 times before
running the recovery script.
# If the check fails on the 4th time (after
recovery), the state type will turn to
# "hard" and contacts will be notified of the
problem. Hopefully this will restart
# things successfully, so the 4th check will result
in a "soft" recovery. If that
# happens no one gets notified because we # fixed
the problem!
4)
echo -n "Restarting print spooler service (4th soft
critical state)..."
# Call the check_nrpe plugin to execute the recovery
script.
/usr/local/nagios/libexec/check_nrpe -H 192.168.1.31
-c check_rspooler
;;
esac
;;
# The monitor somehow managed to turn into a hard error
without getting fixed.
# It should have been restarted by the code above, but
for some reason it didn't.
# Let's give it one last try, shall we?
# Note: Contacts have already been notified of a problem
with the service at this
# point (unless you disabled notifications for this service)
HARD)
echo -n "Restarting service..."
/usr/local/nagios/libexec/check_nrpe -H 192.168.1.31 -c
check_rspooler
;;
esac
;;
esac
exit 0
****************************************************************
cut here
Now if I cd to /usr/local/nagios/eventhandler and enter this command:
./reset_spooler_04 CRITICAL HARD 4
The rspooler.bat file on the Win 2k box actually runs and logs the reset.
So far so good. Now I need to set up a service definition for the event
handler:
#
# reset_spooler_04 Command
define command{
command_name reset_spooler_04
command_line /usr/local/nagios/eventhandler/reset_spooler_04
}
And I need to include the event handler in the service definition:
define service{
host_name fc-ctx-04
use rs-windows-service
service_description printq_service
max_check_attempts 5
event_handler reset_spooler_04
normal_check_interval 10
retry_check_interval 1
check_command check_nt_perf!"\\Print
Queue(_Total)\\Jobs"!4!5
}
So now the big picture. We have a working monitor on the Nagios server
which monitors
the number of print jobs in the Win 2k print queue. We generate an
alarm when the number
of jobs in the queue exceeds 5. I can test this by pulling the paper
tray on the printer and
queueing up print jobs. The montor works fine.. It goes into alarm
when the print queue
gets up to 5 jobs.
But the event handler doesn't work properly. Here's the log:
[07-02-2004 13:51:41] SERVICE EVENT HANDLER:
fc-ctx-04;printq_service;CRITICAL;HARD;5;reset_spooler_04
[07-02-2004 13:51:41] SERVICE ALERT:
fc-ctx-04;printq_service;CRITICAL;HARD;5;6
[07-02-2004 13:50:41] SERVICE EVENT HANDLER:
fc-ctx-04;printq_service;CRITICAL;SOFT;4;reset_spooler_04
[07-02-2004 13:50:41] SERVICE ALERT:
fc-ctx-04;printq_service;CRITICAL;SOFT;4;6
[07-02-2004 13:49:42] SERVICE EVENT HANDLER:
fc-ctx-04;printq_service;CRITICAL;SOFT;3;reset_spooler_04
[07-02-2004 13:49:42] SERVICE ALERT:
fc-ctx-04;printq_service;CRITICAL;SOFT;3;6
[07-02-2004 13:48:36] SERVICE EVENT HANDLER:
fc-ctx-04;printq_service;CRITICAL;SOFT;2;reset_spooler_04
[07-02-2004 13:48:36] SERVICE ALERT:
fc-ctx-04;printq_service;CRITICAL;SOFT;2;7
[07-02-2004 13:47:37] SERVICE EVENT HANDLER:
fc-ctx-04;printq_service;CRITICAL;SOFT;1;reset_spooler_04
[07-02-2004 13:47:37] SERVICE ALERT:
fc-ctx-04;printq_service;CRITICAL;SOFT;1;7
So, it seems to me that Nagios is calling the event handler, and is
calling it n a CRITICAL HARD state. Why, oh why, I
wonder, isn't the rest of the system working? The above event did NOT
result in an entry in the rspooler.log on the
Win 2k machine.
As I said earlier, if I call the event handler as user nagios from the
command line, it does run rspooler and makes the
log entries on the Win 2k machine.
Any help would be appreciated...
Glenn Meisenheimer
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list