nagios blocking on notifications?
Mike Lindsey
mike-nagios at 5dninja.net
Thu Jan 14 22:56:38 CET 2010
I've got a high volume site. Everything seems to keep up reasonably
well, unless there are a good number of state changes. Once services
start changing state, and notifications start getting sent out, nagios
falls behind.
Did some digging in the logs and it looks like while a batch of
notifications are being sent out, it's rate limiting to about one per
five seconds. Also, from the first notification for a service to the
last notification for that service, nothing else is written to the logs.
Since a typical notification goes out to 15+ people, that's over a
minute with no service check handling.
Is there something going on under the hood that I'm not aware of (like,
is it just not doing the log writing, but still doing the passive
service check handling, and there's something else causing my latency?)
Is that delay configurable? I don't see anything in the docs for that.
I've even set my notification script to just call and background a
secondary script, to try and see if it wasn't a delay in the
notification script, but that seemed not to do anything at all. Should
I be forking the notification script instead?
Here's a log snippet:
[1263505735] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;0;OK load mem ntp
swap cfengine disk|
[1263505735] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;0;OK load mem ntp
swap cfengine disk|
[1263505735] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;1;WARNING [swap
utilization 25%] [/data/ at 77% (inodes 0%)]|
[1263505735] PASSIVE SERVICE CHECK:
<redacted>;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old
[1263505735] PASSIVE SERVICE CHECK: <redacted>;pre_queuedepth;2;CRITICAL
- <redacted> pre_queuedepth status: 2159 > 500
<There's close to 50 line entries with that time stamp>
[1263505735] SERVICE NOTIFICATION:
<redacted>;<redacted>;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL -
<redacted> pre_queuedepth status: 2159 500
[1263505741] SERVICE NOTIFICATION:
<redacted>;<redacted>;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL -
<redacted> pre_queuedepth status: 2159 500
The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for
the next minute+, then it goes back to it's usual happy speed.
Is this an artifact of the way it logs, or is the whole system choking
while it sends email? I've searched the list archives and not found
anything on this.
--
Mike Lindsey
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list