Lots of hosts, only a couple of services?
Andreas Ericsson
ae at op5.se
Wed Aug 25 16:42:56 CEST 2004
Jason Byrns wrote:
> Thanks to everyone for their input, I certainly appreciate it.
>
> To summarize, it sounds like the place to start is to change our service
> checks from ping to telnet checks. Or possibly even SNMP or something.
> I am also going to change check_host_alive settings, as it only sends
> one packet now. (It was already at five seconds and 100% packet loss
> for critical status, which still seems fair.)
>
> (Is there any advantage to checking SNMP instead of telnet?)
>
> As someone else already mentioned, check_telnet is basically already
> defined as "check_tcp -H (host address) -p 23".
>
Be a bit wary about that one. Some admittedly stupid switches and
routers seem to think it doesn't get RST when check_tcp is dropping the
connection, so you might find yourself locked out of your own
switch/router. Make sure you try it on one you can get your hands on for
an Attila style reboot before setting it up to run against your favorite
satellite.
> As for QoS, I'm not sure that's an option. If one of our wireless
> access points is too busy to reply, wouldn't the AP itself need some
> kind of QoS features to help us? I don't think they do, we've got a
> mixture of older and a few newer Cisco access points, and those are
> usually the ones that may miss a check or two here and there...
>
> As for the max_check_attempts, and how it relates to host and service
> checks, I believe I found my final answer in the Nagios FAQ pages.
> However, after searching yesterday I couldn't find it again. All I
> could find was this page, which mentions exceptions to the monitoring
> logic:
> http://nagios.sourceforge.net/docs/1_0/statetypes.html
>
> ...but says it will not discuss those exceptions for now.
>
> The information I found before basically stated what I said earlier:
> when a single service check fails, a host check is triggered. And if a
> host check then also fails, it then chooses to skip the "soft" error
> states and go straight to a "hard" error state. In other words, ignore
> the max_check_attempts and send out notifications right away. And not
> as a bug, but since, y'know, your HOST is down! Not just a service!
>
> But tweaking our host checks is probably the answer to any single false
> positive warning. Besides, I'm going to go ahead and slap Nagios onto
> one of my test servers, and put together a very simple setup to test
> again how Nagios handles service and host checks and max_check_attempts.
> I'm virtually certain that we were being warned every time, after any
> host failed just a single check, even though my settings look like it
> should take five failed checks in a row.
>
Again, host checks are run in a non-delayed serialized manner, meaning a
max_check_attempts of 4 would yield 40 seconds worth of trying before
deciding its down (assuming you use a check-host-alive timeout value of 10).
> Thanks again, everybody!
>
> --
> Jason Byrns
> System Administrator, MicroLnk
>
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Lead Developer
-------------------------------------------------------
SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media
100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33
Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift.
http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list