Adding more advanced correlation to nagios with sec (any interest?)

John P. Rouillard rouilj at cs.umb.edu
Sat Jul 19 04:25:55 CEST 2003


Hello.

I apologize for taking so long to respond to you. Things at work have
been really hectic.

In message <20030711150412.B84683 at IPAustralia.Gov.AU>,
Stanley Hopcroft writes:

>On Sat, Jun 28, 2003 at 03:48:16PM -0400, John P. Rouillard wrote:
>> However, I have some things that I want to do that are not easily
>> done within nagios. E.G.
>> 
>>    If a system jumpstart is in progress, ignore warnings about high
>>      interface usage (on one interface), or dropped packets (on the
>>      hub).
>> 
>>    If an index operation of the HTTP server is in progress, ignore
>>      warnings about the http interface being slow.
>> 
>>    I also want to show a host/service down if a given system went down,
>>      (as determined by a syslog message) but I want the report done
>>      ONLY if the system isn't back up in 5 minutes.
>> 
>> Note that none of the rebooting, indexing, or jumpstarting operations
>> occur at fixed times, so I can't schedule these in advance.
>
[...]
>However, please would you spell what events and their origin are 
>correlated by Sec to avoid spurious alarms in these cases - especially 
>the first two. Is Sec correlating plugin failures with syslog messages ?

For the first, I have a tftp daemon that logs via syslog. A child sec
(invoked via the spawn action) watches the /var/adm/messages file, and
sends a message "jumpstart_in_progress_on_16_net" to the parent sec
that is handling the nagios alerts. This causes suppression of the
warnings until a target file is retrieved near the end of the
jumpstart which resets the context allowing the events to pass.

The last problem is handled by generating a 5 minute
HOSTNAME_REBOOTING context/flag when the reboot syslog message is
received.  The existance of this context enables a suppress rule that
gobbles all of the events for the rebooting device. After 5 minutes or
the arrival of a "system up" syslog entry, the context is
destroyed. If the host is still down nagios's next poll of the device
will cause sec to pass the events to nagios for reporting.

The HTTP index program is started from a shell script that sends a
trap when the indexing operation starts and stops.

>> I have a method of integrating sec <http://www.estpak.ee/~risto/sec/>
>> into nagios to handle these issues and more.
>> 
>> Using sec to process traps (or other passive checks) is straight
>> forward. The trap collector running from snmptrapd just dumps the trap
>> report (formatted as a nagios passive service check) into sec's input
>> fifo and then sec processes it, and reports it (if needed) into the
>> nagios.cmd pipe.
[...]
>Sec has become for me, the standard way of providing event and trap 
>handlers.
>
>For example, I have a general host and service handler that updates a 
>MySQL DB with the outage interval. To do this it must retain state (and 
>does so with a Perl hash tied to a DB file) so it can determine if there 
>has been a transition and if so, how long it was.
>
>This would probably be easier to do with Sec contexts.

One way to handle it is to store the start and stop times for the
event in a context's event store using the add command with %u (the
current time as number of seconds since Jan 1 1970).  Then report
these to an external shell script, and have it subtract them to get
duration of the outage. With the ability to trigger perl programs, you
could probably do it all within sec and remove the need for an
external program.

>> However for polled items, it more difficult. I don't want to have a
>> flapping service where the plugin determines that there is a problem,
>> nagios reacts to that, and then sec reacts to that (being fed its info
>> by an event handler) by clearing the service because sec determines
>> that there is not yet a problem. This leads to a flapping service as
>> nagios and sec disagree on what is a true problem, and leads to
>> spurious notifications because I can't put in a high
>> max_check_attempts and have nagios respond to sec when it has a real
>> problem (unless I define yet another service yech).
>> 
>> What I did was write a plugin in perl (sec_filter) that runs the
>> nagios command (sort of like check_ssh). It always passes the output
>> of the plugin to sec's input pipe.  However, depending on the flags
>> given to the sec_filter script, it will exit:
>> 
>>     with an "ignore OK" code, and no output
>>     with an "ignore ERROR" code, and no output
>>     with the exit code and output of the plugin
>> 
>> I have chosen exit status of 5 for "ignore OK" and 6 for "ignore
>> ERROR". (It looks like code 4 is used internally for pending states,
>> and I didn't want to use that number hence my choice of 5 and 6.)
>> 
>> The reason for these new codes is to make nagios not change any status
>> for the polled service based on the poll. The new status will be sent
>> to it by a passive check command generated from sec.
>> 
>> That is I want nagios to be a (almost) dumb poller and to let sec
>> filter all the data. 
>
>If I understand correctly, the proposal is
>
>1 When Nag schedules a service check, of any and all service checks, it
>  in fact execs sec_filter with the real plugin name and flags that
>  determine sec_filters behaviour by allowing it to either

Correct.

> 1.1 treat the service as a normal Nag service (a 'polled' service, for 
>     which no event correlation by Sec is necessary)

Almost. It may or may not be correlated by Sec. It's just that you
want the initial report to be recorded in/acted on by Nag. Sec may
still clear the event.

> 1.2 treat the service as requiring Sec processing to accurately
>     determine the service state. Sec will get the plugin output and
>     use this with other Sec inputs and Sec context to determine the 
>     service state

Correct.

>2 Sec_filter writes
>
> 2.1 For those services requiring Sec,

I would say: for those services (service events) being reported only
via Sec,

>   2.1.1 An event to Sec
>
>   2.1.2 One of the new status codes to Nagios
>
> 2.2 Otherwise, in the case of 'polled' services, the usual Nag status 
>     codes and plugin output are written to Nags input queue

Correct.

>3 Nag processes former status codes with no changes (i.e. CRITICAL leads
>to the check being repeated retry_interval and if the state persists to
>Notification), but those with the new code of IGNORE_ERROR are
>recognised as requiring retry at the retry_interval but _no_ other
>processing.

Exactly. However, it looks like I don't have that quite down in my
code yet. I sometimes have services dropping into an unknown state
when sec is suppressing a report, but I am not sure why.

>4 Sec will eventually submit a PROCESS_SERVICE_CHECK_RESULT to the Nag 
>input queue (for the services that have formerly been reported as 
>IGNORE_\w+.

Yes. It will usually submit the OK state, but there is no requirement
for that. Maybe this is where the unknown state is coming from? Is
there a default "freshness" on polled items that results in an unknown
state?

>My remarks are
>
>1 This _may_ be better done in the Nag core. Nag could be equipped with
>  configuration directives for Sec processing so that Nag itself could
>  submit the event to Sec (rather than the plugin sec_filter). This 
>  saves an extra fork.

I agree with this. It could be generalized to allow diversion of the
plugin's report to an arbitrary file/pipe/program in addition to or
instead of sending it to nagios.

Another network monitoring package that I use has no method of
intercepting the events between the time they are generated and the
time they are acted upon by the core.  This leads to a lot of useless
event traffic running through the system.

>2 I am not sure how your proposal relates to the embedded Perl stuff 
>  (where each plugin is called as a function from the Nagios address 
>  space).
>

I currently use a subroutine call in sec_filter to lock the sec input
file so I don't screw up the data. This is probably unnecessary since
the size of the data is small enough that it should be an atomic
write, but I prefer to be safe. However sec_filter would probably have
to be modified to be embedded perl safe.

>  This is probably trivial since sec_filter simply becomes another Perl
>  plugin that Nag calls (and sec_filter 'requires' the real Perl plugin so
>  that re-compilation of the real plugin is avoided

Hmm, will that work, does require keep it in the same name/function
space? Also the sec_filter would have to be rewritten to detect that
it is running a perl script, and require it's argument. Currently
sec_filter can run any nagios plugin. I think this is an argument for
putting the diversion mechanism in the core.

>3 I like the bit about making Sec processing optional (depending on the 
>  options specified to sec_filter)

I see two uses for optional processing:

  you may want use the plugin output to affect correlation of other
  services and not have itself processed/correlated by sec (as you
  mention above). This allows service, cluster and other dependencies
  to be implemented in sec rather then in nagios.

  Triggering Nag's event handlers (especially in soft state). While
  you can run commands from Sec as well, the soft/hard state and
  number of calls is handled better from Nagios. These event handlers
  can be written to provide additional info to sec. It will result in
  flapping (flipping) service states as nagios and sec disagree about
  the state of the service, but with properly set retries, and
  nagios's soft and hard states it may be useful.

>For me, I am quite happy with Nags processing of most services. I can't 
>say that the scenarios you mention are problematic for me. However, I 
>would very much like the option of event correlation when required.
> 
>> I have set it up so that sec itself is a passive nagios service, and
>> automatically sends notifications to nagios, as well as nagios being
>> able to poll the sec service if its data gets stale.
>> 
>> So is anybody interested in my mods (about 30 lines) to nagios to
>> support this, and my plugin?
>
>This needs the comment of the Nagios developer. It sounds attractive to 
>me however.

I haven't seen any signs of interest from the Nagios developer(s). I'm
not even sure of they are interested in/know of this patch. As I said
I still have a few issues to work out, and I think the developer(s)
could do a cleaner implementation of my patch to add the ignore OK and
ignore ERROR functionality. Obvious to implement this in the core
would be something that the developers would need to be involved in
since it is a larger job than I have done.

>I am sorry if these remarks are stupid or based on misunderstanding. I 
>think I would need to see the mods for a better (marginally) response.
>
>It may simply be worth posting them to Nagios-Devel. AFAIK this is not 
>on the Nag road map so it simply may be a golden opportunity for a big 
>benefit.

Your remarks are correct. I'll try to pull the patches and things
together in the next couple of weeks. It's still got that annoying
unknown state issue, but it doesn't look like I will be able to do
more work on it.

>Finally, you have identified a good area for future development. Root 
>cause analysis and event correlation is one area that commercial 
>products can claim superiority. 

The funny part is that sec was written as a lower cost alternative to
HPOV's correlation engine. I have had at least one report that it is
easier to use then HPOV's commercial tool.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.


-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0




More information about the Users mailing list