Nagios Hang?
Mike Koponick
mkoponick at redhawk.info
Wed Feb 15 17:32:35 CET 2006
Marc,
I doubled check the disk space last night thinking that might be the
issue, but I have plenty of space:
Filesystem Size Used Avail Use% Mounted on
/dev/hda3 109G 70G 33G 68% /
/dev/hda1 99M 28M 66M 30% /boot
As for the processes, I also thought of that scenario. All were killed
prior to restarting. I'm going to build a version of nagios with
debugging turned on this morning and run it.
Thanks!
Mike
Here are a couple of samples of my hosts/services from the sensor:
########################################################################
####
define host {
host_name Switch-35
alias Switch-35
address 10.xx.xx.xx
hostgroups Company_Switches
max_check_attempts 10
check_interval 1
active_checks_enabled 0
passive_checks_enabled 1
check_period 24x7
obsess_over_host 1
check_freshness 0
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 1
contact_groups Support
notification_interval 2
notification_period 24x7
notification_options d,u,r
notifications_enabled 0
register 1
}
########################################################################
####
########################################################################
####
define service {
hostgroup_name Company_Switches
service_description check_ping
is_volatile 1
check_command check_ping!150.0,20%!200.0,60%
max_check_attempts 2
normal_check_interval 1
retry_check_interval 1
passive_checks_enabled 0
active_checks_enabled 1
check_period 24x7
parallelize_check 0
obsess_over_service 1
check_freshness 0
event_handler_enabled 0
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups Support
notification_interval 99
notification_period 24x7
notification_options w,u,c,r,f
notifications_enabled 0
register 1
}
########################################################################
####
Hosts/Services from the Central Server:
########################################################################
####
define host {
host_name Switch-35
alias Switch-35
address 10.xx.xx.xx
hostgroups Company_Switches
max_check_attempts 1
check_interval 1
active_checks_enabled 0
passive_checks_enabled 1
check_period 24x7
obsess_over_host 1
check_freshness 0
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 1
contact_groups Support
notification_interval 1
notification_period 24x7
notification_options d,u,r
notifications_enabled 1
register 1
}
########################################################################
####
########################################################################
####
define service {
hostgroup_name Company_Switches
service_description check_ping
is_volatile 1
check_command check_stale
max_check_attempts 1
normal_check_interval 2
retry_check_interval 1
active_checks_enabled 0
passive_checks_enabled 1
check_period 24x7
parallelize_check 1
obsess_over_service 1
check_freshness 2
freshness_threshold 660
event_handler_enabled 1
low_flap_threshold 0
high_flap_threshold 0
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups Support
notification_interval 0
notification_period 24x7
notification_options w,u,c,r
notifications_enabled 1
register 1
}
########################################################################
####
-----Original Message-----
From: nagios-users-admin at lists.sourceforge.net
[mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of Marc
Powell
Sent: Wednesday, February 15, 2006 8:21 AM
To: Nagios Users
Subject: RE: [Nagios-users] Nagios Hang?
> -----Original Message-----
> From: nagios-users-admin at lists.sourceforge.net [mailto:nagios-users-
> admin at lists.sourceforge.net] On Behalf Of Mike Koponick
> Sent: Wednesday, February 15, 2006 10:10 AM
> To: Nagios Users
> Subject: [Nagios-users] Nagios Hang?
>
>
>
> I'm running Nagios 2.0 (Stable) on Redhat 9.0, in a distributed
> environment. I'm utilizing NSCA for checks and all appears to be
working
> properly.
>
>
>
> I'm running into several issues that seemed to have "started all of a
> sudden".
>
>
>
> 1) On my distributed server, I don't see syslog messages any
longer,
> with the exception of "INITIAL SERVICE STATE" messages. Syslog is
working,
> and in the nagios.cfg file, "nagios.cfg:use_syslog=1" I used to see
all
> the check messages, etc. Nothing in the configuration has changed to
the
> best of my knowledge.
>
Make sure you haven't run out of disk space. Verify your log_ settings
in nagios.cfg.
>
> 2) Nagios appears to "hang" on the remote sensor. Once I receive
> notifications that network devices are down, I never see a recovery of
the
> network devices, even though they are recovered. The work around is to
> restart nagios with "service nagios restart". Sometimes, this takes
> multiple tries.
Could be related to multiple nagios processes as below. One daemon sees
the down and another sees the up. What have you verified so far? I'd
check disk space, use strace to see what the daemon is doing, turn up
logging as much as possible for both nagios and nsca and watch the logs.
> 3) When I have a massive network outage, I receive the
appropriate
> alerts but I receive multiple "PROBLEM" notifications. I'm only using
> service checks (I'm only using check_ping currently) and the
> notification_interval set to "0", which according to the documentation
> should limit the amount of messages I'm receiving to "1", unless I'm
using
> the service escalations, which I am not at this time. I am not
receiving
> multiple notifications for "OK" messages, which is what I would
expect.
Without seeing any example host and service config information this
sounds very much like you might have multiple nagios daemons running at
the same time. Stop nagios, make sure they're _all_ stopped and restart
nagios.
--
Marc
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=k&kid3432&bid#0486&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list