SEC correlation patches for nagios

John P. Rouillard rouilj at cs.umb.edu
Sat Aug 16 07:57:49 CEST 2003
Previous message: URGENT BUSINESS ASSISTANCE
Next message: Nagios 1.1 cosmetic bug in Tactical Overview CGI?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello all:

Sorry this took so long, but I have been very busy.  Here are my
patches to nagios to support SEC, the simple event correlator engine.

With these changes you can define rules that:

  ignore high bandwidth usage while a system jumpstart is occuring.

  ignore warning level of http slowdowns while an indexing process
  is running.

  ignore interface status changes on certain ports of a switch untill
  three changes haave occurred in 5 minutes.

The patch below adds two more exit values for the plugins to use:

    "OK_IGNORE" value 5 with no output from the plugin
    "ERROR_IGNORE" value 6 with no output from the plugin

I have chosen exit status of 5 for "OK_IGNORE" and 6 for
"ERROR_IGNORE". (It looks like code 4 is used internally for pending
states, and I didn't want to use that number hence my choice of 5 and
6, upon looking at the code again, it appears that I was mistaken
about the user of 4.)

If nagios receives one of these new exit codes, it should not change
the current state of the polled service based on the poll. The new
status will be sent to it by a passive check command generated from
sec.

I want nagios to be a (almost) dumb poller and to let sec filter all
the data. Using sec provides much better control over flap detection,
and multiple service correlation. I said I wanted nagios to be an
almost dumb poller. This is because I want nagios to poll at the
retry_interval if there is a problem found by the plugin, and the
regular check_interval otherwise. If sec_filter (or other plugin)
exits with status 6, then nagios should poll at the faster retry
interval. This allows sec to better determine the trouble the system
is in, or more easily determine when the system recovers.

I was having a problem with the services being polled going into a
unknown state occasionally when sec was actively supressing a warning
or error state. I am not sure why this was occuring and has so far
evaded my efforts at identifying the problem. This problem occurred
when I was (mistakenly) running an earlier version of the patch that
used value 4 for the IGNORE_OK and value 5 for IGNORE_ERROR. The
current patch uses 5 and 6, but I don't expect the problem will be
resolved. If anybody who knows the internals of nagios better than I
do can improve on the patch, please feel free.

To the nagios developer(s) are you interested in this patch at all?

The patch and sec_filter files are attached. The sec_filter file
discusses how to set up sec and nagios to monitor the sec process and
report if it goes down.

You should be able to apply the patch to nagios-1.0 and 1.1. cd to the
top of the nagios tree and run "patch -p1 < patch". sec_filter isn't
perfect with respect to the default arguments. However I wanted to get
this in circulation.

If there are any questions, I will do my best to answer them.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

-------------- next part --------------
--- ./base/checks.c.orig	2003-05-21 15:30:43.000000000 -0400
+++ ./base/checks.c	2003-06-05 20:45:10.000000000 -0400
@@ -665,7 +665,7 @@
                         }
 
 		/* make sure the return code is within bounds */
-		else if(queued_svc_msg.return_code<0 || queued_svc_msg.return_code>3){
+		else if(queued_svc_msg.return_code<0 || queued_svc_msg.return_code>STATE_MAXIMUM_VALUE){
 
 			snprintf(temp_buffer,sizeof(temp_buffer),"Warning: Return code of %d for check of service '%s' on host '%s' was out of bounds.%s\n",queued_svc_msg.return_code,temp_service->description,temp_service->host_name,(queued_svc_msg.return_code==126 || queued_svc_msg.return_code==127)?" Make sure the plugin you're trying to run actually exists.":"");
 			temp_buffer[sizeof(temp_buffer)-1]='\x0';
@@ -891,6 +891,16 @@
 
 		/* hey, something's not working quite like it should... */
 		else{
+		  if ( temp_service->current_state>=STATE_OK_IGNORE &&
+		       temp_service->current_state<=STATE_ERROR_IGNORE ){
+			if(temp_service->check_type==SERVICE_CHECK_ACTIVE ) {
+			  if(temp_service->current_state == STATE_OK_IGNORE){
+			    temp_service->next_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
+			  } else {
+			    temp_service->next_check=(time_t)(temp_service->last_check+(temp_service->retry_interval*interval_length));
+			  }
+			}
+		  } else {
 
 			/* reset the recovery notification flag (it may get set again though) */
 			temp_service->no_recovery_notification=FALSE;
@@ -1084,6 +1094,7 @@
 			if(obsess_over_services==TRUE)
 				obsessive_compulsive_service_check_processor(temp_service,temp_service->state_type);
 		        }
+		}
 
 		/* reschedule the next service check ONLY for active checks */
 		if(temp_service->check_type==SERVICE_CHECK_ACTIVE){
@@ -1101,6 +1112,8 @@
 			schedule_service_check(temp_service,temp_service->next_check,FALSE);
 		        }
 
+		if (temp_service->current_state < STATE_OK_IGNORE || 
+		    temp_service->current_state > STATE_ERROR_IGNORE ){
 		/* if we're stalking this state type and state was not already logged AND the plugin output changed since last check, log it now.. */
 		if(temp_service->state_type==HARD_STATE && state_change==FALSE && state_was_logged==FALSE && strcmp(old_plugin_output,temp_service->plugin_output)){
 
@@ -1131,6 +1144,7 @@
 
 		/* update service performance info */
 		update_service_performance_data(temp_service);
+		} /* current_state < STATE_OK_IGNORE */
 
 		/* break out if we've been here too long (max_check_reaper_time seconds) */
 		time(&current_time);
--- ./base/nagios.h.in.orig	2003-06-05 19:17:47.000000000 -0400
+++ ./base/nagios.h.in	2003-08-15 16:38:23.000000000 -0400
@@ -148,10 +148,14 @@
 
 /****************** SERVICE STATES ********************/
 
+#define STATE_MINIMUM_VALUE		-1
 #define STATE_OK			0
 #define STATE_WARNING			1
 #define STATE_CRITICAL			2
 #define STATE_UNKNOWN			3       /* changed from -1 on 02/24/2001 */
+#define STATE_OK_IGNORE			5       /* do not submit active result for processing, but state ok for scheduling purposes */
+#define STATE_ERROR_IGNORE		6       /* do not submit active result for processing, but state is not ok for scheduing purposes */
+#define STATE_MAXIMUM_VALUE             6
 
 
 
--- ./base/utils.c.orig	2003-06-05 19:18:34.000000000 -0400
+++ ./base/utils.c	2003-06-05 19:18:58.000000000 -0400
@@ -1172,7 +1172,7 @@
 			result=STATE_UNKNOWN;
 
 		/* check bounds on the return value */
-		if(result<-1 || result>3)
+		if(result<STATE_MINIMUM_VALUE || result>STATE_MAXIMUM_VALUE)
 			result=STATE_UNKNOWN;
 
 		/* try and read the results from the command output (retry if we encountered a signal) */
-------------- next part --------------
#! /usr/bin/perl -w

# NOTE: this is not embedded perl safe.
# search for CHECKME to find things you may have to change.

use strict;

use Fcntl ':flock'; # import LOCK_* constants
use Getopt::Long; # import argument parsing code

# CHECKME - set your path to the nagios plugins libexec directory here.
our $LIBEXECDIR = '/tools/nagios-1.0/libexec';

# import nagios support utilities.
# CHECKME change to path to the directory containing your copy of utils.pm.
use lib "/tools/nagios-1.0/libexec";
use lib "./libexec";
use utils qw(%ERRORS &print_revision &support);

# CHECKME - include your path to the sec file here, or use the -O flag.
# location of the file that sec is watching.
our $opt_O='/var/run/nagios_to_sec';
our $passthrough = 1 ; # if 1 pass the exit status and data to nagios
                       # if 0 don't pass data or exit status to nagios

# CHECKME - include your path to libexec here.
# set the path explicitly even though the command names should
# be absolute path names.
$ENV{'PATH'} = "$LIBEXECDIR:$ENV{'PATH'}";

sub help {
    my ($short, $exitcode) = @_;

    $exitcode = 10 if ! defined $exitcode;

    print_revision('sec_filter', '$Rev: 1.0$') if ! defined $short;
    print STDERR << "EOH";

Usage: $0 [-p|-i] [-D] [-O <output file>] 
             [-t <timeout>]  -H <hostname> -s <service> <nagios_cmd>
       $0 -h|-V <any arguments>

  -H (--hostname) <hostname> - the name of the device being monitored.
  -s (--service) <service> - the name of the service being monitored.
  -p (--pass) - pass output and exit status of nagios_cmd to nagios (default).
  -i (--ignore) - ignore output/exit status of nagios_command.  
  -O (--output) - output file to write entry to. This is sec's input file.
  -t (--timeout) - timeout period in seconds. (not yet implemented)
  -D (--debug) - don't execute nagios cmd, just echo command line parsing.
  -h (--help) - present help message and exit.
  -V (--version) - print version of plugin and nagios and exit.

  <nagios_cmd> a nagios command line.
EOH

    exit $exitcode if defined $short;

    print STDERR << "EOH";

This command allows you to poll for the status of various items and
allow that data to be filtered through the sec (simple event
correlation <http://kodu.neti.ee/~risto/sec/>) tool. It does this by
sending a properly formatted nagios external command file entry
including the standard output and exit status of the nagios_cmd to the
file $opt_O or specified by -O. This file should be monitored by sec,
and sec can then send a passive event to nagios by writing to the
external command file.

If the -p flag or --pass (or neither of the -i or -p varients) are
specified, this program will print the output from nagios_cmd and exit
with the same exit code as nagios_cmd.

If the -i flag, --ignore or --nopass options are specified, this
program will exit with error code 4 preserving the current status of
the monitored entity (note this requires a patched nagios, see
below). It is then up to sec to generate a passive status message for
the device and service of interest.

A sample nagios command definition entry in checkcommands.cfg would be:
define command{
   command_name check_http_correlation
   command_line \$USER1\$/sec_filter -H \$HOSTNAME\$ -s \$ARG1\$ \
                  \$USER1\$/check_http -H \$HOSTADDRESS\$ \$ARG2\$
}
(Note lines wrapped to stay under 80 characters.)

With an entry in the services.cfg file like:
define service{
        use                   generic-service
        host_name             hosta,hostb,hostc
        service_description   HTTP
        check_command         check_http_correlation!HTTP -nopass!-c 8 -w 4 \
                              -u http://127.0.0.1:80/index.html
        }
(Note lines wrapped to stay under 80 characters.)

This will divert the output from the check_http to sec. Then sec can
be used to:

    filter out warnings based on time of day.
    filter out warnings based on other operations occurring (e.g
       suppress the warning if sec has been told (say by arival of a trap)
       that an index operation is occurring.)
    provide more extensive flap detection including time based flap
       detection. E.G. disable flap detection on a host if it is
       its normal software load/reboot time.
    reset an error state (if used with -p) if it is later determined to
       not be an error. 
    raise an alert if any two (three ..) events occur in a specific
       order, or if they occur in any order.
    implement flap detetion according to user/device assigned rules.

The format of the messages given to sec allows them to be put directly
in the external command file if they should not be suppressed.

Note that this command needs a patched version of nagios to be of any
use.  The exit code of 5 (OK_IGNORE) or 6 (ERROR_IGNORE) must prevent
nagios from changing the state of the polled service.

You should also configure a service on your nagios host called sec
with the following options:

    define service{ 
	host_name [nagioshost]
        service_description sec
	check_command ping_sec
	max_check_attempts 1
	normal_check_interval [#]
	retry_check_interval [#]
        active_checks_enabled 0 
        passive_checks_enabled 1
	check_period  [timeperiod_name]
        check_freshness 1 
        freshness_threshold [#]
	notification_interval [#]
	notification_period [timeperiod_name]
	notification_options [w,u,c,r] 
	contact_groups contact_groups 
    } 

anything in [] needs to be filled in according to your site's
configuration. I suggest a 5 minute (300 second) freshness_threshold.

Define the ping_sec command defined in the checkcommand.cfg file
using:

    define command {
	command_name ping_sec
	command_line \$USER1\$/ping_sec \$HOSTNAME\$ \$SERVICESTATE\$
    }

and ping_sec is the following shell script:

  #! /bin/sh
  PATH=/bin:/usr/bin:/usr/ucb

  echo "[`date +%s`] PROCESS_SERVICE_CHECK_RESULT;\$1;sec;0;sec is running" \
          > /path/to/nagios_to_sec
  if [ "\$2" -ne 3 ]; then
     echo "Submitting passive check to sec"
     exit 3
  else
     echo "Sec failed to respond"    
     exit 2
  fi

This is what I would like to do. Your nagios may not support
substituting \$SERVICESTATE\$ in service check definitions.  If not,
then you may have to keep some state by touching a file or something
to see if you have been called twice in a small time period.

If sec is working, it should be submitting a passive ok result shortly
after this, this should reset the state to ok, and if the script gets
called again, it will restart from the ok state. If it is called in an
unknown state, it will report sec down.

Then add a rule to sec that will submit this check line directly back
to nagios. This rule reads:

  type=single
  continue=dontcont
  ptype=substr
  pattern=PROCESS_SERVICE_CHECK_RESULT;nagioshost;sec;0;sec is running
  desc=Check message from nagios to see if I am running
  action=write /nagios/external/command/file \$0

This will have sec submit a passive check result saying that it is ok.

Make sure sec has a timed rule to sec that will trigger one minute
before the freshness threshold by using the calendar rule (assuming a
5 minute freshness interval in nagios):

  type=calendar
  time= 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57 * * * * *
  desc=Trigger keepalive message to nagios
  action=shell /bin/echo "[`date +%s`] PROCESS_SERVICE_CHECK_RESULT;nagioshost;sec;0;sec is running" >> /nagios/external/command/file

EOH

    support() if ! defined $short;
    exit $exitcode;
}
sub lock {
     flock(SECINPUT, LOCK_EX);
     # and, in case someone appended
     # while we were waiting...
     seek(SECINPUT, 0, 2);
}

sub unlock {
    flock(SECINPUT,LOCK_UN);
}

our ($cmd, $cmd_exit, $msg);
our ($datetime);

# parse command options
our ($opt_D, $opt_H, $opt_h, $opt_i, $opt_p, $opt_s, $opt_V, $opt_t);
($opt_D, $opt_H, $opt_h, $opt_i, $opt_p, $opt_s, $opt_V, $opt_t) = undef;

Getopt::Long::Configure('bundling', 'require_order', 'pass_through');
GetOptions
        ("H=s" => \$opt_H, "hostname=s"  => \$opt_H,
         "s=s" => \$opt_s, "service=s"   => \$opt_s,
	 "p"  => \$opt_p,  "pass!"       => \$opt_p,
	 "i"   => \$opt_i, "ignore"      => \$opt_i,
	 "O=s" => \$opt_O, "output"      => \$opt_O,
         "h"   => \$opt_h, "help"        => \$opt_h,
         "t=i" => \$opt_t, "timeout=i"   => \$opt_t,
         "D"   => \$opt_D, "debug"       => \$opt_D,
	 "V"   => \$opt_V, "version"     => \$opt_V,
	 );

# handle information requests
help() if $opt_h;
print_revision('sec_filter', '$Rev: 1.0$'), exit $ERRORS{'OK'} if $opt_V;

# sanity check arguments
  # do we have any command arguments?
print(STDERR "\nMissing <nagios_cmd>.\n"), help('short') if $#ARGV == -1;

  # If we have an option first in @ARGV, we have
  # an error by definition.
print(STDERR "\nUnrecognized option $ARGV[0].\n"), help('short')
    if $ARGV[0] =~ /^-/;

  # Check to see if the argument starts with a / and we can stat the
  # first non gobbled argument. If not then we have an error in
  # parsing.
print(STDERR "\nCommand path $ARGV[0] is not absolute.\n"), help('short')
    if $ARGV[0] !~ m#^/#;
print(STDERR "\nUnable to find command $ARGV[0].\n"), help('short') 
    if ! -f $ARGV[0];

print(STDERR "\nMissing -H hostname.\n"), help('short') if ! $opt_H;
print(STDERR "\nMissing -s service.\n"), help('short') if ! $opt_s;
print(STDERR "\nOnly one of -p or -i can be specified.\n"), help('short')
    if $opt_i && $opt_p;

  # check for access to sec input (our output) file.
print(STDERR "Unable to write to or find output file $opt_O.\n"),
    help('short', $ERRORS{'CRITICAL'}) if ! -w $opt_O ;

# process options
# error is already generated if both $opt_i and $opt_p are set.
$passthrough = $opt_p if defined $opt_p;
$passthrough = 0, if $opt_i;

$opt_t = $utils::TIMEOUT if ! defined $opt_t; # default timeout

$SIG{'ALRM'} = sub {
    print "sec_filter svc=$opt_s timeout: No response from command $ARGV[0]\n";
    exit $ERRORS{'UNKNOWN'};
};
alarm($opt_t);

$cmd = join(' ', @ARGV);
if (! $opt_D) {
    $msg = `$cmd 2>&1`;
    $cmd_exit = $? >> 8;
    $msg = "Command $cmd returned out of bound exit code $cmd_exit.", $cmd_exit=3 if ($cmd_exit < 0 || $cmd_exit > 3);
}

$datetime=time();

$SIG{'ALRM'} = sub {
    print "sec_filter svc=$opt_s timeout: while accessing in output file\n";
    exit $ERRORS{'UNKNOWN'};
};
alarm($opt_t);

if ( ! $opt_D ) {
    open(SECINPUT, ">> $opt_O") or die "Can't open $opt_O: $!";
    lock();
    print SECINPUT "[$datetime] PROCESS_SERVICE_CHECK_RESULT;$opt_H;$opt_s;$cmd_exit;$msg";
    unlock();
    close(SECINPUT);
} else {
    $msg = "host=$opt_H, svc=$opt_s passdata=$passthrough nagios_cmd=$cmd\n";
    $cmd_exit = $ERRORS{'UNKNOWN'};
    print STDOUT "[$datetime] PROCESS_SERVICE_CHECK_RESULT;$opt_H;$opt_s;$cmd_exit;$msg";
}

if ($passthrough) {
    print STDOUT $msg;
    exit $cmd_exit;
} else {
    if ($cmd_exit == 0) {
        exit $ERRORS{'IGNORE_OK'} if defined $ERRORS{'IGNORE_OK'};
        exit 5;
    } else {
        exit $ERRORS{'IGNORE_ERROR'} if defined $ERRORS{'IGNORE_ERROR'};
        exit 6;
    }
}
Previous message: URGENT BUSINESS ASSISTANCE
Next message: Nagios 1.1 cosmetic bug in Tactical Overview CGI?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list