Query regarding Nagios notification
Satish Kumar P
satishkumarp2k1 at gmail.com
Wed Oct 14 07:43:08 CEST 2009
Hi,
We have a Nagios server that monitors around 300 production servers
and around 2000+ services on all these servers. Recently, when the
STATE of one of the services on a particular host turned HARD, but
Nagios didn't NOTIFY. So I am just trying to understand why it didn't
notify. Here's more information regarding the configuration:
define service {
service_description MAILQ_1K_2K
host_name server-name
use generic-service
check_command check_mailq_snmp!1000!2000
contact_groups cg_server-name
}
define contactgroup {
contactgroup_name cg_server-name
alias server-name Contact Group
members team_emailpage-24x7
}
define contact {
contact_name team_emailpage-24x7
alias team_emailpage-24x7
service_notification_period 24x7
host_notification_period 24x7
service_notification_options c,r
host_notification_options d,r
service_notification_commands notify-by-page,notify-by-email
host_notification_commands host-notify-by-page,host-notify-by-email
email email-address
pager team
}
Following are the few relevant options defined under "generic-service":
check_period 24x7
normal_check_interval 5
retry_check_interval 2
max_check_attempts 5
notification_period 24x7
And following are the corresponding logs when the service went down:
Oct 11 02:19:50 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;WARNING;SOFT;1;mailq is 1358
Oct 11 02:22:49 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;WARNING;SOFT;2;mailq is 1537
Oct 11 02:26:05 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;WARNING;SOFT;3;mailq is 1799
Oct 11 02:28:59 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;WARNING;SOFT;4;mailq is 1799
Oct 11 02:36:53 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;CRITICAL;HARD;5;mailq is 2133
I modified the server names. The WARNING THRESHOLD is 1000 and
CRITICAL THRESHOLD is 2000. After roughly 45 minutes later, the
service recovered, but Nagios didn't fire any alert w.r.t this service
during this whole period (i mean until it came back to OK state).
Nagios logs when this service came back:
Oct 11 03:20:20 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;CRITICAL;SOFT;1;mailq is 2968
Oct 11 03:22:17 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;CRITICAL;SOFT;2;mailq is 2968
Oct 11 03:24:17 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;CRITICAL;SOFT;3;mailq is 2968
Oct 11 03:26:18 nagios-server nagios: SERVICE ALERT:
server-name;MAILQ_1K_2K;OK;SOFT;4;mailq is 411
More info: Looking at Nagios documentation, I understand that Nagios
does "on demand host checks" when a service changes STATE. So I
guessed, Nagios might have performed HOST CHECK when it actually
turned HARD (and simultaneously from WARNING to CRITICAL). And I see
lot of logs related to other services after this SERVICE turned HARD,
but I wonder there should have been NOTIFICATION w.r.t this particular
service. Thoughts??
Nagios version: 3.1.2
O.S: Debian 4.0 (Etch)
Thanks in advance.
Thanks,
Satish
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list