[Nagios-users] [PATCH] reduce notification load; fix $NOTIFICATIONRECIPIENTS$ macro #98
Michael Friedrich
michael.friedrich at univie.ac.at
Tue Nov 22 12:58:53 CET 2011
On 22.11.2011 10:31, Andreas Ericsson wrote:
> On 11/22/2011 01:02 AM, Michael Friedrich wrote:
>> On 21.11.2011 20:56, Andreas Ericsson wrote:
>>> On 11/01/2011 02:05 PM, Michael Friedrich wrote:
>>>> hi,
>>>>
>>>> recently we've been debugging on team icinga in the middle of
>>>> notifications and macros, and while investigating on a users problem,
>>>> we've digged a bit deeper into the notification viability checks,
>>>> resulting in deeper analysis of an Opsview patch to reduce the
>>>> notification load significantly by moving the viability checks from
>>>> the actual notification into the creation of the contacts notified,
>>>> passing only a list of 'qualified' contacts to the actual
>>>> notification logic. the only thing to remark over here is that the
>>>> checks against the valid notification_period now happen sooner, and
>>>> not actually when the notification is sent to each contact.
>>>>
>>>> while implementing that patch into current code (needs some macro
>>>> passing with current code), we did remember nagios bug #98 where the
>>>> $NOTIFICATIONRECEIPIENTS$ macro is demanded to be only populated with
>>>> the actual contacts to be notified, but not all of those assigned to
>>>> the host/service. while this is considered to be a real bug, further
>>>> investigation showed that thanks to the viability checks before
>>>> calling add_notification(), contacts won't be added to that macro as
>>>> the macro logic happens within that function too.
>>>>
>>>> so by applying the attached git patch, you will a. reduce
>>>> notification load and b. fix the $NOTIFICATIONRECEIPIENTS$ macro
>>>> holding all contacts, but not the viable contacts.
>>>>
>>>> since the code remains actually the same on icinga and nagios in this
>>>> stage, the tests can be found at the icinga dev tracker as usual.
>>>> https://dev.icinga.org/issues/1744
>>>> https://dev.icinga.org/issues/2023
>>>>
>>>
>>> I've started looking into this patch right now. It's good to get that
>>> issue (#98) fixed, but I fail to see any noticeable performance
>>> improvement. All contacts potentionally viable for being contacted are
>>> still looked at, but the difference with this patch is that it checks
>>> the viability before shipping it off to add_notification(), which does
>>> fix issue 98 but at the expense of quite a lot of code duplication.
>>
>> normally, all contacts would have been added to the notification_list in
>> memory, even those not actually passing the viability checks. but at
>> this stage of the code, nobody is aware of that so the list gets
>> populated either way by calling add_notification().
>>
>> /* add all individual contacts for this host */
>> ^^^
>>
>> having that notification_list created, this remains fully linked in
>> memory. let's say, you have a bunch of some 1k contacts for that
>> service, and actually the alarm would hit only those in the nonworkhours
>> or workhours timeperiod and only on critical, for the ops team e.g.
>> so by looping through the notification_list, you will encounter *all*
>> contacts for that host, only the duplicates have been removed.
>>
>> /* notify each contact (duplicates have been removed) */
>>
>> then you'll fire up the actual notification with calling
>> notify_contact_of_host - and actually in there, the current core checks
>> the viability for the contact to be notified.
>>
>> you are right, if each contact gets notified 24x7 on all
>> notification_options, the algorithm stays the same. but if you happen to
>> have a lot of different contacts assigned to hosts and services, not
>> getting notified each time a notification is triggered, the overall
>> amount of looping through notification_list will be shorter and save
>> some cpu cycles, and probably on larger systems, a bit more than just
>> some as this means a reduction of the looping for each contact to be
>> checked to be notified on the actual end-of-the-line.
>>
>
> Right, but all contacts are still checked for viability, so the amount
> of looping is reduced once for all those who aren't viable, while the
> number of viability checks (which I presume is the expensive part) will
> remain the same.
from that point of view you are absolutely right. thanks for clarification.
>
>> furthermore, where do you get the idea of code duplication from? the
>> only changes made by this patch is actually moving the viability checks
>> and therefore passing an additional function parameter which makes the
>> diff a bit more bloated than it should be.
>>
>
> The fact that the patch introduces eight locations with identical code
> headed with "check now if contact can be notified".
>
> The proper way to do this would be to introduce create_recipient_list(),
> passing it all the variables it needs to produce a list of recipients
> that have duplicates removed *and* are viable for being contacted. If a
> lot of code still has to be duplicated (as in the patch), more helpers
> in the form of add_recipient_for_service(&mac, srv, cntct) would be
> nifty so the viability check can be moved there without breaking the
> abi for create_notification_list_from_{host,service}().
i see. sounds like some hacky hours ;)
>
> I'm in the middle of a release at $dayjob so I've had to postpone that
> until next week or so. I wouldn't mind if you beat me to it ;)
hehe. you are not alone. i'm in the middle of release ($icinga) and
preparations ($osmc) for next week, so maybe i'd just get onto some devs
in nuremberg and we'll try hack together ;-)
--
DI (FH) Michael Friedrich
Vienna University Computer Center
Universitaetsstrasse 7 A-1010 Vienna, Austria
email: michael.friedrich at univie.ac.at
phone: +43 1 4277 14359
mobile: +43 664 60277 14359
fax: +43 1 4277 14338
web: http://www.univie.ac.at/zid
http://www.aco.net
Lead Icinga Core Developer
http://www.icinga.org
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
More information about the Developers
mailing list