Another new nagios based project - enhanced correlation for nagios
John P. Rouillard
rouilj+nagiosdev at cs.umb.edu
Fri Dec 22 15:59:16 CET 2006
In message <200612201251.05730.p.millar at physics.gla.ac.uk>,
Paul Millar writes:
>I've been lurking on this list for a few weeks and would like to introduce
>myself and the project I'm working on ("MonAMI").
>
>I've been working on a project to implementing a "universal" sensor.
My project isn't nearly as ambitious as Paul's however it scratches my
itch.
If you have dealt with HP Openview/NNM you know that Nagios can
monitor everything it can plus more (and at a much cheaper price
8-)). However Nagios still has some issues in the correlation/root
cause analysis area.
To improve this capability, I have implemented a patch and event
broker module for Nagios 2.x. It is much cleaner and faster than my
original set of patches to do this for Nagios 1.0. My patch and module
let you intercept the results from an active event and pass them
through an external correlation engine (I use SEC - the simple event
correlator) before Nagios acts on the results (sort of).
Above I said "sort of", Nagios gets the results of the active event,
but the state information associated with the active event can be
modified by the the plugin so that Nagios doesn't act on the active
event results until the correlation engine can look at it.
I think this breaks some new ground for the Nagios event broker (NEB)
in that the module can modify Nagios state/work flow, prior to this
the NEB modules are supposed to be read only.
The slides (and notes) for a Work In Progress talk that I did LISA
2006 are available at:
http://www.usenix.org/events/lisa06/wips/rouillard.pdf
or from my homepage:
http://www.cs.umb.edu/~rouilj/#secnagios
I have been running it on a test nagios installation at my employer,
Renesys Corporation, who are sponsoring this work. It has been working
as expected for the past couple of months and it is currently
scheduled to go into beta test with three testers in a couple of weeks
and I hope to have a final release version done by mid March. Just to
give you and idea of what I can do with this, or want to do with it:
wait for 4 consecutive OK polls before clearing the service
(like a max_check_attempts for the OK state). Inspired by the
rearm threshold in HPOV NMM.
analyze the output of the plugin, or the severity of the problem and
change:
polling intervals on a per problem (not per service) basis
max_check_attempts on a per problem (not per service) basis
correlate using historical information. E.G. Service A went critical 5
minutes ago but is now clear. However service B is now critical
because of the service A failure. The patch allows you to write
external correlation rules to detect this issue and modify the
service B state accordingly.
Different thresholds for a single service. E.G. between 7AM
and 6PM allow only one processes to run while outside that
range allow two processes to run.
More uses are discussed in the slides mentioned above.
There are three components associated with my work:
1) Patch to nagios to add a new NEB callback type and the
associated infrastructure.
2) The file for the Nagios Event Broker module ext_corr.c.
3) Patch to nagios to add two new attributes to the
"service" object to control the NEB plugin's actions.
I will be releasing all of them under the GPL and would like to get at
minimum the first patch if not all three components integrated into
the nagios core.
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list