Perceived problem with host checks
John Fox
jjf at mind.net
Fri Sep 6 21:46:59 CEST 2002
Hello,
I'm configuring a Nagios 1.0b4 installation. It's the first time I've
used this product, and I've run into somewhat of a stumbling block.
Both hosts used in my tests are running FreeBSD 4.6-STABLE and nagios
is installed via the ports system.
That said, here are the details:
I've configured nagios to do host checks for host A" and service
checks for HTTPD on A.
I start HTTPD on host A and fire up nagios (in daemon mode) on host B.
Everything is fine. Host and service are both marked up UP.
I use ipfw to disable ICMP on host A. This is done with the intent of
provoking a host check, knowing that the host-check test makes use of
ping.
Host continues to remain marked as up. This makes sense to me, given
that HTTPD is still running and accessible there.
I kill HTTPD on A.
Both host and service become marked as 'down' and I begin to
receive problem notifications.
I enable ICMP on A, knowing that the host-check-alive command
makes us of 'check_ping' plugin, and expecting that host A will
soon be marked as 'UP'.
But that does not happen; the host continues to be marked as down. I
watch the various status screen and see multiple host tests
performed. I recieve multiple problem notifications.
I'm flummoxed by this, and login to host B (the nagios machine) and
veryify that I can ping A from there. I can. I then run
check-host-alive's "check_ping" plugin from the command line. It
instantly returns with a "PING OK" response. (Note: I used the exact
same command structure as nagios would -- I took it from the
'check-host-alive' definition found in 'checkcommands.cfg'.)
Yet the 'Host Information' pages shows the Status info as
"Critical -- Plugin timed out after 10 seconds".
So to all appearances, nagios and I are getting different results
from the exact same command line. I don't believe this is what's
really going on, because it seems absurd to me. So I go to the FAQ.
I see a question that seems to apply: "Hosts are incorrectly listed
as being DOWN or UNREACHABLE". But after reading it, I'm not sure
that it does apply.
The way I read it, nagios didn't perform any host checks on A until
A's HTTPD went down. Makes sense.
At which point a host check is performed -- if the host check doesn't
return 'OK', it is run again and again until it has made
max_check_attempts (from the host definition) attemps OR recieved
an "OK' response.
My max_check_attempts is set to 3. But in observing the various
status screen, I saw the "Last Status Check" value changing every 3
minutes. In the course of this test, I allowed the downtime to reach
46 minutes, which to me indicates that 15 host checks were run.
Obviously, this is a much larger number than 3. And certainly it
seems that the plugin never recieved an 'OK' response. This is quite
a conundrum to me!
I then restarted HTTPD on host A. Within three minutes, this service
was once again marked as 'UP' and the host, too, was again marked as
'UP', with the 'Host Information' pages "Status Information" field
reading "PING OK...".
On the off chance that my IPFW/ping machinations were somehow causing
wierdness, I repeated the same basic experiment, but rather than
disabling ICMP, I ifconfig'd my network card down. And rather than
re-enabling ICMP, I ifconfig'd the interface back up. This resulted
in the same behavior as the previous test.
I don't see this as a major issue, given that a successful service
check causes the host to be again considered 'UP'. But it troubles me
to not understand the behavior I'm seeing, as I'm simply unable to
account for it.
Any advice or thoughts would be very much welcomed!
Thanks in advance,
John
--
+---------------------------------------------------------------------------+
| John Fox <jjf at mind.net> | System Administrator | Internet Ventures Oregon |
+---------------------------------------------------------------------------+
| "You can't talk about George W. without addressing the strange |
| Bilbo-Baginnian language that spurts out from between his lips like |
| melted marshmallows coming out of a squirt gun." -- Dennis Miller |
+---------------------------------------------------------------------------+
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
More information about the Users
mailing list