Severe peformance issue during major network outage
Aidan Anderson
mail at aidananderson.co.uk
Fri May 11 13:02:10 CEST 2007
Hi,
I have recently set up Nagios 2.8 and am monitoring 1623 hosts and 1946
services. Performance under normal circumstances is fine. Typical
check and latency times are as follows:
Monitoring Performance
Service Check Execution Time: 0.03 / 11.04 / 3.418 sec
Service Check Latency: 0.00 / 1.87 / 0.479 sec
Host Check Execution Time: 0.03 / 10.04 / 0.843 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec
# Active Host / Service Checks: 1623 / 1946
# Passive Host / Service Checks: 0 / 0
The vast majority of these hosts are spread over 320 geographic
locations throughout the UK. These locations are connected to our data
centre via a hardware VPN device with the majority (about 270) using a
private ADSL circuit to facilitate the VPN connection.
Yesterday, we had a major outage caused by the failure of one of the
ADSL central routers at our ISP. This took out a third of our ADSL
sites (roughly 90) for 16 minutes. Each of these sites has about 4
devices monitored by Nagios so in effect about 360 devices (hosts) went
down in an instant.
As you can imagine, we were aware of the problem almost immediately due
to the barrage of phone calls from out clients, but unfortunately Nagios
didn't even remotely reflect the current situation. I have used parent
child relationships to the full so I was expecting a good portion of the
VPN devices to show as down with all other devices behind the VPN device
showing as unreachable. This was not the case. It actually took half
an hour to find only 20 of these VPN devices down and another half an
hour to notice that they were actually back up again having only noticed
20 of the 90 in the first place. During the outage, the service check
latency was increasing exponentially and the performance stats half an
hour after the start of the problem were as follows:
Monitoring Performance
Service Check Execution Time: 0.03 / 11.04 / 3.646 sec
Service Check Latency: 947.84 / 2080.05 / 1467.274 sec
Host Check Execution Time: 0.03 / 10.04 / 0.968 sec
Host Check Latency: 0.00 / 0.00 / 0.000 sec
# Active Host / Service Checks: 1623 / 1946
# Passive Host / Service Checks: 0 / 0
As you can see, the average service check latency time has jumped to
1467 seconds (24 mins). On all of these hosts there is only one service
which is a ping (check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5).
The host check is also a ping (check_ping -H $HOSTADDRESS$ -w 3000.0,80%
-c 5000.0,100% -p 1) but much faster with only 1 ping being sent out.
The normal_check_interval on services is 5 mins with 2
max_check_attempts and a retry_interval of 1. The host also has a
max_check_attempts of 2.
A lot of people have mentioned using fping to speed things up but if my
average service latency is only 0.479 seconds in normal circumstances, I
can't see how tweaking this will help in a major outage situation.
I have also read through the section on tweaking performance which seems
to be geared toward protecting the machine Nagios is running on. I want
to do the opposite and give Nagios a lot more work to do. The machine
is dedicated to Nagios and is quite high spec. It's an IBM xServies 336
with 2 Dual Core processors and 4GB of RAM so it should be able to take
a much bigger hit. I have been monitoring CPU performance with MRTG and
the CPU performance never goes lower than 90% idle. Ironically during
the problem, the machines idle time jumped to 95% when I would have
expected to drop rather than increase.
The only performance tweak I could see that would affect the performance
in this situation is max_concurrent_checks but this is already set to 0.
I am fairly new to Nagios (2 months) so I apologise if I have missed
something obvious but any pointers to a solution to this problem would
be greatly appreciated. I have run a nagios -s (attached below) which
seems to indicate that everything is setup ok. Let me know if you
require any more information from my config that would help diagnose the
problem.
regards,
Aidan
Nagios 2.8
Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org)
Last Modified: 03-08-2007
License: GPL
Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 1624
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 1947
Total scheduled services: 1947
Service inter-check delay method: SMART
Average service check interval: 300.00 sec
Inter-check delay: 0.15 sec
Interleave factor method: SMART
Average services per host: 1.20
Service interleave factor: 2
Max service check spread: 30 min
First scheduled check: Fri May 11 11:56:03 2007
Last scheduled check: Fri May 11 12:01:02 2007
CHECK PROCESSING INFORMATION
----------------------------
Service check reaper interval: 10 sec
Max concurrent service checks: Unlimited
PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list