heads-up: trap avalanche depletes swap and leads to killing Nag, Apache, named, ...

Tom DE BLENDE Tom.DeBlende at dhl.com
Thu May 15 09:33:57 CEST 2003


Dear Stanley,

Thanks for this interesting read. It should be of value to all of us 
in a sense that we should be aware that "no news" isn't always "good 
news".

Was flap detection enabled for that passive service? Shouldn't flap 
detection prevent that notification flood?

Kind regards,
Tom


Stanley Hopcroft wrote:
> Dear Ladies and Gentlemen,
> 
> This sites Nag shares a host with snmptrapd, bind, apache and the usual
> suspects.
> 
> Nag is an ePN that can use up to half the 256 MB RAM (it is usually
> cycled each month).
> 
> This evening a load balancer fired off ~ 120 traps in 20 minutes after
> an iPlanet directory server (apparently) started 'running out of file
> descriptors' and therefore binding and unbinding to the listening
> socket.
> 
> tsitc> tail -800 nagios.log | grep -i process_ser | ./ns_log_localtime |
> head
> Wed May 14 19:13:09 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;2;Failed. SLB cannot reach port 389 on real server
> (server failure) castor (10.0.100.11).
>  ...
> Wed May 14 19:32:29 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;0;Ok. SLB can reach port 389 on real server castor
> (10.0.100.11).
> Wed May 14 19:33:09 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;2;Failed. SLB cannot reach port 389 on real server
> (server failure) castor (10.0.100.11).
> Wed May 14 19:33:09 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;0;Ok. SLB can reach port 389 on real server castor
> (10.0.100.11).
> Wed May 14 19:33:09 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;2;Failed. SLB cannot reach port 389 on real server
> (server failure) castor (10.0.100.11).
> Wed May 14 19:33:13 EXTERNAL
> COMMAND: PROCESS_SERVICE_CHECK_RESULT;ServerIron;SLB castor port
> reachability trap;0;Ok. SLB can reach port 389 on real server castor
> (10.0.100.11).
> tsitc> 
> 
> snmptrapd is configured to run a /bin/sh script that interprets the trap
> and injects the process_service_check_result command into the command
> queue.
> 
> On this occasion, however, apparently the swap became over committed
> because
> 
> May 14 19:30:18 tsitc /kernel: swap_pager: out of swap space
> May 14 19:30:18 tsitc /kernel: swap_pager_getswapspace: failed
> May 14 19:30:18 tsitc /kernel: pid 143 (httpd), uid 0, was killed: out
> of swap space
> May 14 19:30:18 tsitc /kernel: pid 58178 (httpd), uid 80, was
> killed: out of swap space
> May 14 19:32:12 tsitc /kernel: swap_pager_getswapspace: failed
> May 14 19:32:41 tsitc /kernel: pid 81284 (nagios), uid 1000, was
> killed: out of swap space
> May 14 19:33:09 tsitc /kernel: swap_pager_getswapspace: failed
> May 14 19:33:11 tsitc last message repeated 112 times
> May 14 19:33:11 tsitc /kernel: pid 78804 (nagios), uid 1000, was
> killed: out of swap space
> May 14 19:33:11 tsitc last message repeated 2 times
> May 14 19:33:13 tsitc /kernel: pid 78074 (nagios), uid 1000, was
> killed: out of swap space
> May 14 19:33:15 tsitc /kernel: pid 54997 (nagios), uid 1000, was
> killed: out of swap space
> May 14 19:33:15 tsitc /kernel: pid 91002 (nagios), uid 1000, was
> killed: out of swap space
> May 14 19:42:12 tsitc /kernel: pid 75 (named), uid 53, was killed: out
> of swap space
> 
> and I eventually realised that things were strangley quiet.
> 
> Part of the collateral included 2 or 3 copies of a shell listening on
> port 162. Perhaps this was the forked copy of snmptrapd before the
> execve had completed. These processes had to be killed manually before
> snmptrapd could be restarted (and bind to that port).
> 
> Unfortunately, while I was using the host at the time, I failed to
> notice the impact until I became aware of 120 Perl processes waiting on
> the SMS lockfile.
> 
> Obviously this was the cause of the memory over commit: even though they
> were all asleep they still occupied 5 MB of memory each.
> 
> There seems to be a need for me to rethink my notification tactics, but
> in any case, a high rate or large number of service criticals is going
> to make life hard for the Nag host.
> 
> 
> Yours sincerely,
> 
> 
> 
> 



-------------------------------------------------------
Enterprise Linux Forum Conference & Expo, June 4-6, 2003, Santa Clara
The only event dedicated to issues related to Linux enterprise solutions
www.enterpriselinuxforum.com

_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list