passive on-demand host checks being converted from soft to hard
Frost, Mark {PBG}
mark.frost1 at pepsi.com
Mon Jul 7 22:38:22 CEST 2008
I've been seeing a problem with on-demand host checking since we moved
to a distributed setup. We're running Nagios 3.0.2 with a central
server that does virtually no checks. All checks are performed by 2
other distributed servers.
I have an example situation here where the distributed node detects a
service failure then host failure. On the distributed node, I see:
Host Down[07-07-2008 15:30:44] HOST ALERT:
mfrost_win;DOWN;HARD;10;FPING CRITICAL - PB9700DL1JDGHD1.corp.pep.pvt
(loss=100% )
Host Down[07-07-2008 15:29:42] HOST ALERT:
mfrost_win;DOWN;SOFT;9;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:28:40] HOST ALERT:
mfrost_win;DOWN;SOFT;8;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:27:38] HOST ALERT:
mfrost_win;DOWN;SOFT;7;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:26:36] HOST ALERT:
mfrost_win;DOWN;SOFT;6;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:25:34] HOST ALERT:
mfrost_win;DOWN;SOFT;5;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:24:32] HOST ALERT:
mfrost_win;DOWN;SOFT;4;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:23:30] HOST ALERT:
mfrost_win;DOWN;SOFT;3;FPING CRITICAL - mfrost_win (loss=100% )
Host Down[07-07-2008 15:22:28] HOST ALERT:
mfrost_win;DOWN;SOFT;2;FPING CRITICAL - mfrost_win (loss=100% )
Service Critical[07-07-2008 15:22:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:26] HOST ALERT:
mfrost_win;DOWN;SOFT;1;FPING CRITICAL - mfrost_win (loss=100% )
Service Critical[07-07-2008 15:21:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.
But for the corresponding set of activities I see the following on the
central/reporting server:
Service Critical[07-07-2008 15:22:29] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
Host Down[07-07-2008 15:21:33] HOST ALERT:
mfrost_win;DOWN;HARD;1;FPING CRITICAL - mfrost_win (loss=100% )
Service Critical[07-07-2008 15:21:33] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.
The distributed node seems to do what its supposed to do and continues
to retry up to max_retries (10). When that first (soft) ping failure
gets passed to the central/reporting server, it marks it as a
hard/critical and sends an alert out immediately. Meanwhile the
distributed node continues checking for a while until it determines that
the state of the host is hard/critical.
The settings for this host are as follows:
central server:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 0
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
freshness_threshold 1200
distributed node:
max_check_attempts 10
check_interval 0
retry_interval 1
obsess_over_host 1
active_checks_enabled 1
passive_checks_enabled 0
check_freshness 0
freshness_threshold 1200
Everything else works fine monitoring-wise, but this problem has been
bugging me for months now. I'm at that crossroads where I'm trying to
determine if this is a bug or if I'm doing something wrong that I can't
figure out. As far as I can glean from the documentation, this isn't
how this is supposed to work given the way I've configured things.
Thanks
Mark
-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
More information about the Developers
mailing list