Caching (?) problem with nagios 2.7
Thomas Schimpke
schimpke.thomas at bhn-services.com
Sat Feb 10 10:52:39 CET 2007
Hello,
since a few days I'm having trouble with my nagios setup. The first
indication was, that I'm having trouble sending out host notifications
(but that will be another thread soon). So this morning I decided to
check, if I've also trouble with service notifications.
I took a service that is checked frequently and changed the check
command so that it would fail, generating an error resulting in an hard
state. The service definition looks like this:
# SAP Login
# ---------------------------------------------------------------------
#
define service {
use sap_check
host_name eulep04
service_description SAP Logon
check_command check_sap!00
servicegroups SAP Logon ERP Prod
max_check_attempts 2
normal_check_interval 3
retry_check_interval 1
notification_interval 30
notification_options c,r
contact_groups rz
}
and the template
define service {
name check_sap
use generic-service
is_volatile 0
freshness_treshold 0
check_freshness 0
notification_period 24x7
process_perf_data 0
register 0
}
and
define service {
name generic_service
acitve_checks_enabled 1
passive_checks_enabled 1
parallelize_checks 1
check_period 24x7
obsess_over_service 0
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_peref_data 1
retain_status_information 1
retain_nonstatus_information 1
register 0
}
(I've typed in the two templates -- so syntax errors may be due to
transcription). This configuration worked for a long time now, I think
without any problems.
What I did was, that I changed the instance number in the check_comand
from 00 to 10. This check would fail, since we have no SAP system witth
instance number 10. After saving my changes I reloaded nagios's
configuration (/etc/rc.d/init.d/nagios reload). Then I waited. Actually
I waited for a long time -- 15 Minutes or so. The service stayed in
state OK. I saw, that for this time nagios did not perform *any* checks
of this service (I looked at the last check time in the service
overview). I verified, that nagios re-read the new configuration
successfully -- I looked at the service check command under "view
configuration". So I forced an service check via the CGI. That helped...
[1171096494] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;eulep04;SAP
Logon;1171096485
[1171096499] SERVICE ALERT: eulep04;SAP Logon;CRITICAL;SOFT;1;SAP System
on host xxx.xxx.xxx.xxx (instance 10 ) is down.
[1171096559] SERVICE ALERT: eulep04;SAP Logon;CRITICAL;HARD;2;SAP System
on host xxx.xxx.xxx.xxx (instance 10 ) is down.
[1171096559] SERVICE NOTIFICATION: rz_call_home;eulep04;SAP
Logon;CRITICAL;service_notify_by_call;SAP System on host xxx.xxx.xxx.xxx
(instance 10 ) is down.
So my service notification worked -- I received a call. BUT it worked
only, after i forced the check.
So I decided to re-check: I changed the instance number back to 00 and
restarted nagios:
[1171096695] Caught SIGHUP, restarting...
[1171096695] Nagios 2.7 starting... (PID=14025)
[1171096695] LOG VERSION: 2.0
[1171096696] INITIAL HOST STATE: apps;UP;HARD;1;PING OK - Packet loss
... (many more of the initial host/service states)
then I waited for about 20 minutes. The service was never checked !
I forced the check and then:
[1171097971] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;eulep04;SAP
Logon;1171097965
[1171097976] SERVICE ALERT: eulep04;SAP Logon;OK;HARD;2;SAP System on
xxx.xxx.xxx.xxx (instance 00) is up.
[1171097976] SERVICE NOTIFICATION: rz_call_home;eulep04;SAP
Logon;OK;service_notify_by_call;SAP System on xxx.xxx.xxx.xxx (instance
00) is up.
Has someone an idea what's going on here or give me a hint how to
resolve this issue ? I'm feeling quite bad about this situation because
we have a large installation and we (like everyone on this list i
suppose) depend upon nagios... I' m not sure, if this was also an issue
with nagios 2.5 -- as I wrote I upgraded to 2.7 on friday because I had
(and still have) problems with host notifications.
Also strange: we were running nagios 2.3.1 for a long time on another
machine (RedHat 9, 32Bit -- so very old) without any problems. The
current problems appear on a 64Bit FC5 machine (I migrated my nagios
installation several weeks ago to nagios 2.5 there and did not notice
these problems -- but I may have overlooked them).
Thanks in advance for any help & ideas
Thomas
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list