Nagios 'Out Of Memory' Problems
Armistead, Raffy
rarmistead at datanamicsinc.com
Thu Mar 23 19:23:13 CET 2006
I have a problem with my Nagios server constantly crashing. It keeps
outputting on the screen Out of Memory errors which causes loss of
access to the server. I can ping the box but I cannot SSH or web into it
to view any information. This has been happening increasingly more
lately. Now it is about every 2-3 days that this is occurring. We have
been adding more and more devices to the servers and this problem has
been increasing as this occurs. This is how I have it set up.
I have a Main Nagios server that is running the latest 2.0 (stable)
Nagios release. It is monitoring about 6800 devices but it is not
actively checking the devices. Its main role is to provide a web
interface and receive passive polls from three other servers which do
the polling. The main server also does email notifications when a device
goes down. The server sends about 30-40 emails a day. I am using NSCA
2.5 between the server and the client Nagios servers. I am only
monitoring one service for each device which is either TCP or ping
depending on the device. Mostly all devices are monitored with TCP
(roughly 6000). The rest are monitored with ping. The individual servers
are pretty evenly spread with the number of devices. They are about
2000-2500 each.
All the servers are just basic computers, Dell Dimension 2400s with base
hardware. The main server was upgraded to 2GB RAM while the other
servers are running 512MB each. They are all running Celeron 2.4 GHz
processors. The individual servers are not having out of memory problems
and they are running the latest 2.0 (stable) release as well. They all
run RedHat 9.0 with everything installed for the packages.
Can someone please help me in resolving this problem? Thanks.
The TOP process does not appear like it is running out of memory. This
is the normal output when the server has been running for a few hours.
57 processes: 54 sleeping, 3 running, 0 zombie, 0 stopped
CPU states: 41.1% user 58.8% system 0.0% nice 0.0% iowait 0.0%
idle
Mem: 2063556k av, 285940k used, 1777616k free, 0k shrd, 41056k
buff
177644k actv, 51688k in_d, 10892k in_c
Swap: 1044184k av, 0k used, 1044184k free 114208k
cached
Here is a sample configuration that I have on the devices on the main
server:
hosts.cfg
define host {
name generic-host ; The name of this host
template - referenced in other host definitions, used for template
recursion/resolution
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 0 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information
across program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
max_check_attempts 10
notification_interval 720
notification_period 24x7
obsess_over_host 0
notification_options d,u,r,f
register 0 ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL HOST, JUST A TEMPLATE!
}
define host {
use generic-host ; Name of host
template to use
host_name DETAH-R1
alias DETAH-R1
address x.x.x.x
check_command check_ping!200,40%!10000,100%
contact_groups device-admins,DETAH-admins,router-admins
}
services.cfg
define service {
name generic-service ; The 'name' of this
service template, referenced in other service definitions
active_checks_enabled 0 ; Active service checks are
enabled
passive_checks_enabled 1 ; Passive service checks are
enabled/accepted
parallelize_check 1 ; Active service checks should
be parallelized (disabling this can lead to major performance problems)
obsess_over_service 0 ; We should obsess over this
service (if necessary)
check_freshness 1 ; Default is to NOT check
service 'freshness'
freshness_threshold 1800
notifications_enabled 1 ; Service notifications are
enabled
event_handler_enabled 0 ; Service event handler is
enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information
across program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
is_volatile 0
check_period 24x7
max_check_attempts 6
normal_check_interval 20
retry_check_interval 5
notification_interval 720
notification_period 24x7
notification_options n
register 0 ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
define service {
use generic-service ; Name of
service template to use
host_name DETAH-R1
service_description PING
contact_groups device-admins,DETAH-admins,router-admins
check_command check_ping!200,40%!1000,100%
}
Here is a sample config on the individual server.
hosts.cfg
define host {
name generic-host ; The name of this host
template - referenced in other host definitions, used for template
recursion/resolution
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 0 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information
across program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
max_check_attempts 10
notification_interval 720
notification_period 24x7
obsess_over_host 0
notification_options d,u,r,f
register 0 ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL HOST, JUST A TEMPLATE!
}
define host {
use generic-host ; Name of host
template to use
host_name DETAH-R1
alias DETAH-R1
address x.x.x.x
check_command check_ping!200,40%!10000,100%
contact_groups device-admins,DETAH-admins,router-admins
}
services.cfg
define service {
name generic-service ; The 'name' of this
service template, referenced in other service definitions
active_checks_enabled 1 ; Active service checks are
enabled
passive_checks_enabled 1 ; Passive service checks are
enabled/accepted
parallelize_check 1 ; Active service checks should
be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this
service (if necessary)
check_freshness 1 ; Default is to NOT check
service 'freshness'
freshness_threshold 1800
notifications_enabled 1 ; Service notifications are
enabled
event_handler_enabled 0 ; Service event handler is
enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information
across program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
is_volatile 0
check_period 24x7
max_check_attempts 6
normal_check_interval 20
retry_check_interval 5
notification_interval 720
notification_period 24x7
notification_options n
register 0 ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
define service {
use generic-service ; Name of
service template to use
host_name DETAH-R1
service_description PING
contact_groups device-admins,DETAH-admins,router-admins
check_command check_ping!200,40%!1000,100%
}
Raffy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20060323/d4c0c140/attachment.html>
More information about the Users
mailing list