nagios writing escalation rules multiple times to objects.cache
Andreas Ericsson
ae at op5.se
Tue Oct 2 11:06:53 CEST 2012
On 10/01/2012 09:37 PM, Chris Baldwin wrote:
> Short version:
>
> I have an ever-growing Nagios install for monitoring a bunch of linux
> hosts (currently 99 hosts & 2322 services, I plan on adding 115 more
> hosts & 1500+ services). I've noticed something odd with my escalation
> rules - they're being repeated multiple times in my objects.cache file.
> This is started to affect performance for parts of my nagios install, to
> the point where it's painfully slow to use the web interface.
>
> My google-fu is weak today, so I was hoping someone here could point me
> in the right direction.
>
> Longer version:
>
> I have 4 escalation rules:
> -Our helpdesk gets notification #1 for critical issues.
> -Our on-call person gets notifications 1 -> 12 @ 5 minute intervals 24x7.
> -The relevant IT-group leader(s) get notifications 5->12 @ 5 minute
> intervals during on call periods.
> -Our CIO gets notification 12 -> infinity at 60 minute intervals during
> on call periods.
>
> We use puppet to control our environment, and it's amazing for deploying
> servers and adding them to nagios. Once I'm able to bring in other
> aspects of our environment under puppet control (firewall, sudo, yum
> repos), it will be trivial to set up a server from scratch and monitor it.
>
> In order to create a new set of escalation rules, we use a custom class
> on the puppet server and a small bit of code to be executed from the
> client-side (of puppet) to make this work. An example:
>
> # Escalate to the_boss. He, in turn, will call people. I
> imagine this
> # to be along the lines of Hulk nudging Thor playfully in The
> # Avengers. And sending him flying through a few bulkheads.
> nagios::server::escalations { "Boss-critical":
> contact_groups => "the_boss",
> escalation_options => "c,r",
> escalation_period => "oncall_hours",
> first_notification => "12",
> last_notification => "0",
> notification_interval => "60",
> servicegroup_name =>
> "Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie",
> }
>
> I know this portion works correctly - it's producing my desired result,
> which is 1 file per (set) of escalation rules specified. I have 1722
> escalation cfg files.
>
> The cfg files look something like this:
>
> define serviceescalation{
> contact_groups the_boss
> escalation_options c,r
> escalation_period oncall_hours
> first_notification 12
> host_name my.hostname.xyz
> last_notification 0
> notification_interval 60
> #service_description
> Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie
> servicegroup_name
> Disk,Ping,HTTP,Load,MySQL,Ping,Procs,SSH,Swap,Users,Zombie
> }
>
So you're assigning it to a host_name along with a set of servicegroups.
I'm not entirely sure that makes 100% sense, since servicegroup members
already have a host_name.
It might work better with Nagios 4, but I'm not sure. If it doesn't,
I'll fix it so 'service_description' is required when 'host_name' or
'hostgroup_name' is set, as I don't see how one makes sense without
the other.
>
> My questions to you guys:
> - Am I crazy to think that it's reading every rule once for *each*
> server?
It seems as if it's reading the rule once for each host mentioned in
host_name and then assigning it to each member of the servicegroups
listed, so if you have identical escalations assigned to the same set
of servicegroups then this is really how you're configuring your
Nagios.
Nagios 4 has provisions to compare slave objects and avoid adding
multiple ones, which would hide a potential bug in your config. It's
currently only used for dependencies, but making it work with
escalations too would be the final fallback to fix this.
However, I urge you to look over your configuration first to make
sure you don't really have multiple escalations assigned to the
same set of servicegroups.
>
> I tried using the precache, it didn't help. Both files were created by
> my nagios install.
That's not surprising, as precaching and caching uses the exact same
code.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list