nagios stops to check & orphans
Samuel Bancal
sam.bancal at gmail.com
Mon Jun 8 12:51:47 CEST 2009
Hi,
I'm a new Nagios administrator (since feb 09).
Until now, every thing was quite fine. Working smoothly ... ok!
This morning I saw that during the week-end, the Nagios daemon stopped from
doing checks.
After some research (on the server and on the web), here is what I've got.
Does someone can explain me more on it ... And how not to have this problem
again ...
OS : Ubuntu server 8.04.2 LTS
Versions : nagios-3.0.6 & nagios-plugins-1.4.13
Hardware : on Vmware server infrastructure.
NTP is not set yet (I don't know if it has a side effect in my case...
Because time may be involved in the problem ...).
We're monitoring at this time 12 hosts and 64 services.
What I can see on the web interface (In scheduling Queue) :
Last check Next
check
server_xxx 2009-06-07 03:52:35 2009-06-07
09:19:45 Orphan ENABLED
server_yyy service_zzz 2009-06-07 03:50:31 2009-06-07 09:19:45
Orphan ENABLED
All hosts and services except 2 are "orphan"...
Both "last check" and "next check" are from yesterday morning!
On the server:
$ ps auxft | grep nagios\.cfg | grep -v grep
nagios 20578 0.4 72.9 2969592 1505772 ? Ssl Apr30 275:20
/usr/local/nagios/bin/nagios -d /etc/nagios/nagios.cfg
-> Wow ... nagios uses 72.9% of the server's memory!
$ free
total used free shared buffers cached
Mem: 2062920 1636656 426264 0 4404 24532
-/+ buffers/cache: 1607720 455200
Swap: 1951888 1450744 501144
What about forks?
$ pstree -aclpn
init,1
#snip
├─nagios,20578 -d /etc/nagios/nagios.cfg
│ └─{nagios},20579
#snap
What about the log ?
In /var/nagios/archives/nagios-06-08-2009-00.log
...
thousands of :
[1244325825] Warning: The check of service 'Partition /' on host
'server_xxx' looks like it was orphaned (results never came back). I'm
scheduling an immediate check of the service...
and later, thousands of :
[1244355705] Warning: The check of service 'HTTP' on host 'server_xxx' could
not be performed due to a fork() error: 'Cannot allocate memory'. The check
will be rescheduled.
If I do a strace on process 20578, it loops with :
nanosleep({0, 250000000}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1892, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1892, ...}) = 0
And a strace on process 20579 it loops with :
poll([{fd=5, events=POLLIN}], 1, 500) = 0
A part of the config :
$ egrep 'status_update|reaper|orphan' /etc/nagios/nagios.cfg
status_update_interval=10
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_for_orphaned_services=1
check_for_orphaned_hosts=1
Thanks for any reply,
Best regards,
Samuel Bancal
--
Samuel Bancal - CH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090608/4e187a87/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises
looking to deploy the next generation of Solaris that includes the latest
innovations from Sun and the OpenSource community. Download a copy and
enjoy capabilities such as Networking, Storage and Virtualization.
Go to: http://p.sf.net/sfu/opensolaris-get
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list