Problem with some NSCA packets getting corrupted on 64-bit SLES 10
Frost, Mark {PBG}
mark.frost1 at pepsi.com
Thu Jan 17 16:37:22 CET 2008
I've recently begun an effort to move our Nagios installation to a
distributed architecture from a centralized one. I had previous used
NSCA only for a very few passive checks and it works fine on a 32-bit
Red Hat AS 3 platform (the centralized server).
In testing on a distributed architecture (which is 64-bit Suse Linux
Enterprise Server (SLES) 10), I seem to have a problem with NSCA. (Note
that all Nagios and NSCA binaries and libraries were recompiled on the
64-bit platform).
After I broke out all the checks to have 2 separate distributed nodes
send to a central server, I saw a few messages like this one in the
nagios.log file:
[1200583727] Warning: Passive check result was received for service '0'
on host 'HOSTXXX', but the service could not be found!
but only about every 1 out of 10 or maybe 20 results was doing this.
That is, the rest of the results were being correctly shown as "EXTERNAL
COMMAND" and all expected NSCA fields came up correctly (hostname,
service desc, check result, text output).
I started having the "send_nsca" script from the distbributed nodes log
what they were sending to a file. When I correlate what they're sending
with what the NSCA daemon thinks it's receiving, the client is still
sending the correct 4 fields, but it's as if the NSCA daemon is dropping
the 2nd field (service desc) and replacing it with the check result
field. So ultimately, it thinks the service name is '0'.
I can't see that this matches a pattern (i.e. always on the same hosts
or same service checks). All I've seen so far is that it happens
whether I run NSCA as --single or --daemon. It also happens even if I
turn off one of the distributed nodes (that is, I can't see it being
volume related).
I have turned on debugging in the NSCA daemon to see what it thinks it's
getting and it echoes what the nagios.log shows:
SERVICE CHECK -> Host Name: 'HOSTXXX', Service Description: '0', Return
Code: '0', Output: ' rta=0.140000 ms)'
Again, maybe only 1 out of 10. Ultimately, this causes the server to
run an active check as it thinks it never got a result from the
distbributed node.
I'm still trying to dig deeper, but it seems to me that this is
increasingly pointing to some issue with 64-bit SLES. Or perhaps some
variable type in NSCA daemon that's not quite right for 64-bit. It's
hard to tell with its intermittent nature and the fact that I have yet
to discover a pattern.
Has anyone seen anything like this before?
Thanks
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list