Problems with nagios
Wheeler, JF (Jonathan)
J.F.Wheeler at rl.ac.uk
Fri Mar 14 11:52:00 CET 2008
In the past I have reported problems when our master server has failed
with "Out of memory" problems caused by all server memory and swap space
being used up. I have largely (but not completely) solved these by
increasing the number of "Command" and "Check result" buffers. However
I would like some explanations of the following problems (note that I
run 1 master and 5 slave servers - shortly to be come 6 slaves; the
master server runs nagios, nsca and ndo2db daemons):
1. When I arrived this morning, there were 27000+ nsca processes waiting
to run. Counting the number of processes showed that the number was
increasing by at least 10 per second.
2. Recently a restart of the nagios daemon (on the master server) has
hung after 27 seconds and does not reach completion.
3. For some restarts of the nagios daemon (for example, after a
configuration change), the command pipe cannot be created because there
is a normal file in its place - is this real file created by a nsca
process ? Can I stop this happening ?
4. After a reboot of the master server to try to fix problems 1 and 2
above (I have tried restarting nsca and nagios, and killing many of the
nsca processes), the nagios daemon did not update any of its log files
(see the following outputs from command "nagiosstats -c
/etc/nagios/nagios.cfg":
Nagios Stats 2.10
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 10-21-2007
License: GPL
CURRENT STATUS DATA
----------------------------------------------------
Status File: /var/log/nagios/tmpfs/status.dat
Status File Age: 0d 0h 56m 56s
Status File Version: 2.10
Program Running Time: 0d 0h 57m 34s
Nagios PID: 3229
Used/High/Total Command Buffers: 0 / 0 / 40960
Used/High/Total Check Result Buffers: 0 / 0 / 61440
Total Services: 18688
Services Checked: 18688
Services Scheduled: 26
Active Service Checks: 4882
Passive Service Checks: 13806
Total Service State Change: 0.000 / 94.540 / 0.082 %
Active Service Latency: 0.207 / 94495564.236 / 19643.884
sec
Active Service Execution Time: 0.116 / 31.104 / 0.612 sec
Active Service State Change: 0.000 / 94.540 / 0.105 %
Active Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Service State Change: 0.000 / 76.250 / 0.074 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 17257 / 210 / 174 / 1047
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 907
Hosts Checked: 901
Hosts Scheduled: 0
Active Host Checks: 907
Passive Host Checks: 0
Total Host State Change: 0.000 / 20.000 / 0.162 %
Active Host Latency: 0.000 / 235.096 / 4.491 sec
Active Host Execution Time: 0.000 / 10.127 / 0.358 sec
Active Host State Change: 0.000 / 20.000 / 0.162 %
Active Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 859 / 48 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
Output from command "nagios -s /etc/nagios/nagios.cfg":
Nagios 2.10
Copyright (c) 1999-2007 Ethan Galstad (http://www.nagios.org)
Last Modified: 10-21-2007
License: GPL
Warning: Host 'Dont know 1 on 184' has no services associated with it!
Warning: Host 'Dont know 2 on 184' has no services associated with it!
Warning: Host 'babarams1' has no services associated with it!
Warning: Host 'babarams1-2' has no services associated with it!
Warning: Host 'babarams1-3' has no services associated with it!
Warning: Host 'babarams1-4' has no services associated with it!
Warning: Host 'babarams2' has no services associated with it!
Warning: Host 'babarams2-2' has no services associated with it!
Warning: Host 'babarams2-3' has no services associated with it!
Warning: Host 'babarams2-4' has no services associated with it!
Warning: Host 'c2certdb' has no services associated with it!
Warning: Host 'c2certdlf' has no services associated with it!
Warning: Host 'c2certlsf' has no services associated with it!
Warning: Host 'c2certns' has no services associated with it!
Warning: Host 'c2certstager' has no services associated with it!
Warning: Host 'ctsc18' has no services associated with it!
Warning: Host 'jra1dch01' has no services associated with it!
Warning: Host 'jra1dcp01' has no services associated with it!
Warning: Host 'swt-4400-1' has no services associated with it!
Warning: Host 'swt-5510-1' has no services associated with it!
Warning: Host 'swt-5510-2' has no services associated with it!
Warning: Host 'swt-5510-3' has no services associated with it!
Warning: Host 'swt-5530-0' has no services associated with it!
Warning: Host 'swt-55xx-ads' has no services associated with it!
Warning: Host 'swt001' has no services associated with it!
Warning: Host 'swt002' has no services associated with it!
Warning: Host 'swt003' has no services associated with it!
Warning: Host 'swt004' has no services associated with it!
Warning: Host 'swt005' has no services associated with it!
Warning: Host 'swt006' has no services associated with it!
Warning: Host 'swt007' has no services associated with it!
Warning: Host 'swt008' has no services associated with it!
Warning: Host 'swt010' has no services associated with it!
Warning: Contact 'guyDaytime' is not a member of any contact groups!
Warning: Contact group 'aix-ads-contacts-callout' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-build' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-preprod' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'castor-contacts-srmV2' is not used in any
host/service definitions or host/service escalations!
Warning: Contact group 'corew' is not used in any host/service
definitions or host/service escalations!
Warning: Contact group 'tape-robot-contacts-callout' is not used in any
host/service definitions or host/service escalations!
Projected scheduling information for host and service
checks is listed below. This information assumes that
you are going to start running Nagios with your current
config files.
HOST SCHEDULING INFORMATION
---------------------------
Total hosts: 907
Total scheduled hosts: 0
Host inter-check delay method: SMART
Average host check interval: 0.00 sec
Host inter-check delay: 0.00 sec
Max host check spread: 30 min
First scheduled check: N/A
Last scheduled check: N/A
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 18688
Total scheduled services: 21
Service inter-check delay method: SMART
Average service check interval: 11742.86 sec
Inter-check delay: 85.71 sec
Interleave factor method: SMART
Average services per host: 20.60
Service interleave factor: 1
Max service check spread: 30 min
First scheduled check: Wed Mar 12 10:10:11 2008
Last scheduled check: Thu Mar 13 04:00:00 2008
CHECK PROCESSING INFORMATION
----------------------------
Service check reaper interval: 4 sec
Max concurrent service checks: Unlimited
PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
Output from command "cd /var/log/nagios; ls -ltr . rw tmpfs":
rw:
total 0
prw-rw---- 1 nagios apache 0 Mar 12 09:41 nagios.cmd
.:
total 59264
-rw-rw-r-- 1 nagios nagios 2483 Mar 5 08:39 downtime.log
drwxr-xr-x 2 nagios nagios 12288 Mar 12 00:00 archives
-rw------- 1 nagios nagios 22832729 Mar 12 08:41 retention.dat
-rw-r--r-- 1 nagios nagios 15081485 Mar 12 08:42 objects.cache
drwxr-sr-x 2 nagios apache 4096 Mar 12 08:42 rw
-rw-rw-r-- 1 nagios nagios 96471 Mar 12 08:42 comment.log
drwxrwxrwt 2 root root 60 Mar 12 08:42 tmpfs
-rw-rw-r-- 1 nagios nagios 22564667 Mar 12 08:42 nagios.log
tmpfs/:
total 20864
-rw-r--r-- 1 nagios nagios 21333927 Mar 12 08:42 status.dat
Any comments, advice etc would be most appreciated as it is getting
rather frustrating when nagios does not perform reliably
Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list