Notification configuration (Was RFC/RFP: Service parents)

Andreas Ericsson ae at op5.se
Wed May 18 16:23:25 CEST 2011
Previous message: Notification configuration (Was RFC/RFP: Service parents)
Next message: Host downtime set - service notifications still sent
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 05/18/2011 02:12 PM, Max Schubert wrote:
> Andreas,
> 
> On Tue, May 17, 2011 at 7:57 AM, Andreas Ericsson<ae at op5.se>  wrote:
>>> Any plans to detatch notification attributes from service / host
>>> definitions in 4.x and make them their own top-level configuration
>>> class like escalations  to make it easier to scale notification
>>> definitions for large projects?
>>>
>>
>> Not really. What would such an object look like? How would it add
>> additional benefit compared to using templates for hosts and services?
>> I think if I could just see some sort of example definition of it I'd
>> get an inkling of why some seem to think it's such a great idea. Right
>> now, I see no additional benefit to it.
> 
> It would look just like an escalation.

So why not just use normal escalations then? The normal case is not
the global superlarge company, but a team of admins all sharing
responsibility for a limited number of hosts.

>  What doesn't work well for
> large configurations with notification policies being stuck into host
> and service objects is this scenario (which is the one we are in at
> work by design):
> * Multiple configuration editors who own various parts of the Nagios
> configuration tree - in our case this used to be one big tree, now we
> have set up separate trees for separate projects - we have about 20-30
> people who can edit their project-specific configurations.

Again not the normal case.

> * A set of services that are global in nature - service ->  hostgroup
> ->  host -  baseline monitoring required by all projects using
> standards established by multiple organizations in our company - for
> our example, base host monitoring with an SNMP agent (6 services
> across every host) - we have other global services as well and a core
> team who develop, maintain an augment both our distributed Nagios
> software and these global services and configurations
> * A set of services that are specific to each project using our
> distributed variant of Nagios - managed by subject matter experts on
> each team.
> 



> With this scenario, how do we let each group that is responsible for
> hosts that have these global services on them create individually
> tailored notification policies since there is one notification policy
> per service?
> * We configure our base service and host to 'notify' on every state
> change using the command name do_nothing
> * We created a custom patch so that when the string 'do_nothing'  is
> seen in the command name this  state change only increments the
> notification count - it does not trigger any external command to run

Good example of "making the unusual possible". Would it suffice to
add an internal command in Nagios so that some magic marker, such as
':' (without the quotes) causes no command to be run? The nifty part
of using the common colon as a magic thing for this is that it's sort
of backwards compatible, as it's been a builtin version of "/bin/true"
in shells since forever.

> * We created a patch (partial - no serialization to disk) for
> escalation logic that tracks in memory when a fault escalation was
> sent so that OK escalations are only sent in response to something
> that was in a fault state.  We are working on completing this patch so
> that across restarts the state is saved.

Nice!

I'd implement this as an external list of contacts that have been
notified of the problem state and therefore should be notified of
the recovery. Make the list accessable through a hash table with
the object name as the key and just walk the (sorted) list of
contacts to be notified when the problem goes away and you'll have
the complete list of contacts to notify to. Unfortunately, adding
additional pointers in the object is a no-go due to ABI compatibility.

I'd happily accept such a patch in a heartbeat, as it'd remove a
bit of complexity in the current code without altering or removing
any API's that broker modules might use.


> * We have all groups use escalations to define their notification
> policies - the service and host notification commands then trigger our
> distributed pollers to send escalation requests to a network-based
> notification service we have that then lets the notification requests
> trigger email, SMS, SNMP traps, etc without having to re-configure
> Nagios for every notification transport /. method change.
> 
> Yeah, it is very ugly, and why?  Because 1 notification policy per
> service, that doesn't scale well when taking advantage of service ->
> hostgroup ->  host mappings, which is a critical pattern to use when
> scaling a configuration.
> 

In your case, I'd probably implement the notification logic outside
of Nagios. That would give you all the flexibility you need, and it
certainly seems like you have the manpower and experience to hack up
such a NagINot (Nagios Intelligent Notifications) addon. I could
imagine it being extremely useful as well, in particular if it uses
either an external daemon or an sqlite database so it's easy to use.

> We have over 9000 hosts being monitored by our distributed framework
> (and growing) with around 30 configuration editors and 120+ users.
> Our distributed framework was centralized and a ''one project for all"
> but now is a cluster of distributed set ups, one distributed set up
> per project, which is scaling nicely.  Our largest distributed
> installations have 3900 and 5100 hosts in them respectively - we have
> 4 other distributed instances that are just getting ramped up and only
> have a few dozen hosts apiece at this point.
> 

A "bit" larger than the average network then ;)

> So while this is ugly, it works!  All editors can define escalation
> objects that take into account both their individual needs for global
> service notifications as well as any project-specific notifications -
> and by putting project-specific hosts in project-specific host groups,
> for most groups, two escalation policy definitions are all that are
> needed per project - one for hosts, one for services.
> 
> If all notifications were just done through an escalation like
> configuration object, life for a big project would be much easier.

Yes, but big projects are not the norm, and big projects usually can
afford to get the know-how of how to work around the glitches. Making
the learning curve of Nagios less steep would make more companies and
home users start to use it, which in the long term means that someone
will come up with a stellar way of managing notifications.

> 1) Having notifications clearly separated as their own configuration
> template in the Nagios DSL makes it much less confusing for people new
> to Nagios to understand 'where to configure notifications'

I doubt it. Keeping the different types of objects and their various
interdependencies to a minimum would be a far better way of achieving
that goal imo. The most frequent question on nagios-users from (really)
new users seems to be "how do I configure a host to be monitored?".
Making that first step easy so new users see good use in Nagios would
make it a lot more alluring for them to learn more and get more
advanced as time goes by.

> 2) The configuration flexibility of the escalation template makes it
> very easy to work with for a large configuration.
> 

True, but with flexibility comes complexity. I certainly don't want
to reduce flexibility, but removing simplicity to extend the current
flexibility isn't really an option.

> Our global and project specific scenario and all the notification
> changes we made is also serving us very well as we grow.
> 

That's good to hear.

> Notifications as separate objects would let us back out a number of
> patches and would reallly simplify our configuraiton and let our
> pollers run hotter .
> 

What patches are those? I'm sure some of them could be of use in the
Nagios core, or perhaps as a separate notification module. Also, if
you were to explain the needs in terms of use-cases, perhaps I can
devise some way for you to patch the core which is acceptable for the
mainline code so you won't have to maintain a separate shallow fork
of Nagios just because you're a (very) large corporation with all of
a large corporation's special needs.

I'd be happy to help with design decisions and patch review even if
I'm reluctant to commit to any coding myself on something that neither
me nor my employer really needs right now.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
Previous message: Notification configuration (Was RFC/RFP: Service parents)
Next message: Host downtime set - service notifications still sent
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list