Dropped NSCA Packets? (WAS Re: Load issues with Nagios)
Ken Snider
ksnider at datawire.net
Fri Jan 31 05:54:44 CET 2003
Carroll, Jim P [Contractor] wrote:
> Ah yes, so you did. I think my brain was stuck on "the box is *always* at
> 100% CPU".
>
> Have you tried truss/strace/whatever is appropriate for the o/s you use?
Interestingly, rebooting the box (and applying an uprev kernel) eradicated
the issue, though, as is always the case with multiple variables, I am now
unsure *which* of these things caused the "fix". I'll revisit this again
should the symptoms recur.
And, for the sake of completeness, this was with 1.0.
Another interesting issue, however.
I've written a small wrapper that allows me to execute arbitrary plugins on
a remote host, and "massage" the data (essentially add the hostname and
integer error code) to the plugin output. This is combined with any other
plugins running (newline between each) and piped to send_nsca.
This works wonderfully. On most of our systems, 5 plugins report every two
minutes.
We have our "freshness" checking set to 8 minutes, or four iterations
without a response form a given passive check (in reality, it is less than
that, because of check/processing latencies, but should *more* than
suffice.). Even with the freshness being set so high, I do notice services
occasionally entering soft "unknown" states (the result of a script that
runs when our services fail their freshness check).
Now, I see two possibilities here. First, congestion. Since we use NTP to
sync our boxes, they *do* literally hammer the box within a second or two of
each other. However, I have nsca spawned through xinetd, and the box seems
to take the connections without issue. I also have send_nsca itself set to
timeout at 30 seconds, which is more than enough time to process the
results, as running nsca(d) in debug mode shows all results processed in 8
seconds or so. So this possibility seems somewhat unlikely.
The second possibility is we're hitting some sort of limit in Nagios itself.
Our command_check_interval is set at -1, while our reaper frequency is 5
seconds, so I don't think it's a pipe related issue (there are, perhaps, 5o
servers that check in nearly simultaneously with about 1K of plugin data).
My question is twofold. First, has anyone else experienced this? And
secondly, does anyone understand the inner workings of send_nsca
sufficiently to explain to me how it deals with spurious network
latency/packet loss or blocking issues? *should* it ever drop a connection
other than when it reaches the 30 second timeout I've set?
--
Ken Snider
Senior Systems Administrator
Datawire Communication Networks Inc.
-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
More information about the Users
mailing list