Problem with high latencies after going distributed
Frost, Mark {PBG}
mark.frost1 at pepsi.com
Thu Jan 24 23:13:13 CET 2008
>-----Original Message-----
>From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
>Sent: Thursday, January 24, 2008 3:33 AM
>To: Frost, Mark {PBG}
>Cc: Nagios Users
>Subject: Re: [Nagios-users] Problem with high latencies after
>going distributed
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Some heavily broken intending there (looks like my mail client gets
>confused)... don't trust the number of ">"!
>
>On 23/01/08 10:47 PM, Frost, Mark {PBG} wrote:
>>
>>
>>> -----Original Message-----
>>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
>>> Sent: Wednesday, January 23, 2008 10:24 PM
>>> To: Frost, Mark {PBG}
>>> Cc: Nagios Users
>>> Subject: Re: [Nagios-users] Problem with high latencies after
>>> going distributed
>> I don't think so. I remember an email from Ton Voon some time
>> ago asking
>> Ethan why the oc[hs]p command are run serially but I don't recall if
>> there was a reply or what else was said...
>>
>> I believe it's either documented in the official doc or some
>> user-contributed doc that the oc[hs]p commands should return
>as soon as
>> possible. It's usually done in Perl using a fork:
>>
>> if (fork==0) {
>> # send stuff via NSCA here...
>> }
>> exit(0);
>>
>>
>>> I guess what I'm thinking here is that unlike a custom
>check, I can't
>>> see most
>>> people needing to customize the passive check result
>process. All the
>>> solutions I've
>>> seen seem to include a named pipe. So why couldn't Nagios support
>>> making the ocsp/ochp
>>> "commands" just named pipes instead. Then instead of a standalone
>>> send_nsca binary,
>>> have the nsca source build a send_nscaD binary (I'm making
>that up) that
>>> reads from the
>>> pipe that nagios writes to and sends directly to nsca on the server.
>>> That sort of
>>> eliminates the middle-man in the process of reporting passive check
>>> results.
>>
>>> I know, I know, I'm free to write the send_nscaD.c code and
>send it to
>>> Ethan :-)
>
>Well... I was thinking about partly re-writing nsca as an event-based
>daemon (supporting only the --single mode, but that would be really
>scalable) using libevent, allowing to pass along the timestamp
> (this is
>a recent feature request) and supporting multi-line responses (for
>Nagios 3) in the process, and finally suggesting this as a base for a
>NSCA v3... I'm not even sure if I would have enough time but since my
>main objective it to learn I wouldn't loose anything trying :).
>
>In the unlikely event that I write it, In the same step I could surely
>to a C version of OCP_Daemon supporting natively the "NSCA v3" protocol
>(it wouldn't be hard)...
>
>I'll have to think about it... I quess the only sane separator to write
>multiple multi-line results on a pipe would be \000 (NULL), so there
>would be 3 mode of operation for send_nsca (and two for nsca_sendd
>(don't you think it sounds better reversed?)):
>send_nsca: compatible (v2 behavior), Single check (additional lines are
>taken as additional output) and multi-check (NULL separated)
>nsca_sendd: single-line (one check/line, OCP_Daemon style) and
>multi-line "NULL-separated).
>
>> I don't know how many people use OCP_Daemon but I had reports
>>>>from a few
>> people that greatly reduced their latency using it and I
>> haven't had any
>> bug reported yet. I believe it's well documented as well, but If you
>> have any feedback on this I'll be happy to get it.
>>
>>> I'm playing with it a bit and have so far had good results.
> I'll have
>>> some
>>> feedback after I've played with it a bit longer. Thanks
>for writing it
>>> and
>>> writing up the docs for it as well!
>
>Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt
>for it ;)
>
>Thomas
I can see that using the OCP Daemon script cut down on my latencies
quite a lot. Unfortunately,
I'm still seeing some "stale" checks on the master server that I can't
explain. I'm starting to
get the feeling that going distributed isn't all it's cracked up to be.
I haven't seen mention in
the docs of the caveats with oc[sh]p and latencies (my books sure don't
mention it) and even the
fact that the supplied submit_service_check script in the distribution
from Ethan is a shell
script that pipes to send_nsca. I'm not all that excited about having
to do a workaround
for this issue.
While the OCP_Daemon seems to help me, I'm a little uncomfortable
running it as a solution to our
issue. First, we don't normally have root access on our boxes so
recreating the FIFOs could be
a problem (or at least a wait). I'm also concerned about requiring
another process external to
Nagios as part of the process. If OCP_Daemon dies at some point, my
distributed nodes are hosed.
I had a few issues with correctly starting Nagios and OCP_Daemon in the
right order when playing
with it last night. Once I got it all going, it worked well but I'm
thinking of having to explain
this to someone here who isn't the Nagios person.
I was thinking of your fork/exec comment above. What if one were to
rewrite the "glue" shell
script (the one that takes the output from Nagios and pipes it to
send_nsca) and do something
similar, but write it in C? Additionally, have the parent fork and exit
(causing Nagios to
think the oc[sh]p completed very quickly) then have the child go on and
send output to send_nsca
separately. For my setup, this has the advantage of not being a
separate process that I need to
make sure continues to run. It also doesn't require synchronizing
listeners on both ends of a pipe
or else one process would hang. It would almost be even better, it
seems to me, if this script
could do the send_nsca functionality (again, as the child) instead of
even having to call send_nsca.
The biggest drawback I can see there is that you can't edit the C
program to show destination server,
etc. You'd just about have to pile on a ton of command line options or
have a config file for it.
Just thinking out loud.
On a related note, I see that according to my performance stats, some
checks are still taking a
very long time to run. Is there some easy way I can see check execution
time per check and track
down which checks are taking such a long time?
Thanks
Mark
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list