Problem with high latencies after going distributed
Thomas Guyot-Sionnest
dermoth at aei.ca
Thu Jan 24 04:23:57 CET 2008
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 23/01/08 10:41 AM, Frost, Mark {PBG} wrote:
>
>
>> -----Original Message-----
>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
>> Sent: Tuesday, January 22, 2008 10:29 PM
>> To: Frost, Mark {PBG}
>> Cc: Nagios Users
>> Subject: Re: [Nagios-users] Problem with high latencies after
>> going distributed
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 22/01/08 09:13 PM, Frost, Mark {PBG} wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Steve Shipway [mailto:s.shipway at auckland.ac.nz]
>>>> Sent: Tuesday, January 22, 2008 8:45 PM
>>>> To: Frost, Mark {PBG}; Nagios Users
>>>> Subject: RE: [Nagios-users] Problem with high latencies after
>>>> going distributed
>>>>
>>>> We've just done exactly the same (Nagios 2.9), and we have
>> a comparable
>>>> size of system (actually a bit larger - 713 hosts, 5834 services).
>>>> After going distributed, we too have this insanely high
>> latency on the
>>>> satellites.
>>>>
>>>> The only possible cause is the OCSP command slowing things
>>>> down somehow.
>>>> This is using the supplied send_nsca call to send the
>> status off to the
>>>> central server...
>>>>
>>>> define command {
>>>> command_name relay
>>>> command_line $USER1$/submit_check_result "$HOSTNAME$"
>>>> "$SERVICEDESC$" "$SERVICESTATEID$" "$SERVICEOUTPUT$"
>>>> }
>>>>
>>>> So it should work. I guess things would be better if it
>> packaged the
>>>> updates up into batches, although it cant do that normally.
>>>>
>>>> I think it might be better to make the OCSP command just dump
>>>> the status
>>>> to a file, and then have a cronjob every 60 seconds that
>> reads the file
>>>> and sends the statuses off as a batch. I will try this here,
>>>> when I get
>>>> the chance.
>>>>
>>>> Steve
>>>
>>> But if the submit_check_result is running slowly, that would
>> only affect
>>> the service
>>> execution time wouldn't it? My understanding of check
>> latency is that
>>> it's the difference
>>> in time between when Nagios schedules a check to run versus the time
>>> that the check
>>> actually starts to execute.
>> You're right, but you're just missing one detail. Nagios runs checks in
>> parallel and then reaps all the service results at once. While it's
>> reaping it can't schedule other checks and it is in the reaping state
>> that Nagios runs host check, event handlers, performance data commands
>> and oc[hs]p commands. All this is done serially and can slow down
>> significantly each service reaping run and thus delay the execution of
>> further checks.
>>
>> I although I never built a distributed system, I designed mine to be
>> easily distributed. Moreover, I used a technique I developed for
>> latency-free performance-data processing (That I still heavily use BTW)
>> to create a way to distribute check results to to a distributed central
>> server in the same latency-free way (Was more like a fun project as I
>> don't use it myself yet).
>>
>> Basically you use the host/service performance data files to get the
>> data, but instead of writing to a file you write it to a named pipe
>> (fifo). That pipe is then read by a high-performance non-blocking
>> event-based Perl daemon (yeah I know that looks like marketing terms,
>> but I can explain further each of them if you like) that forks
>> send_nsca
>> processes to send results in bulk (normally every few seconds though).
>>
>> So Nagios doesn't even loose time rotating a file and all your checks
>> are transmitted almost instantly. See this wiki page for
>> details and code:
>>
>> http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon
>>
>>
>> Thomas
>
> Interesting. Thanks for the explanation. If I understand this right,
> the reason I
> don't see this issue on my old non-distributed system is that when the
> reaping occurs
> there, it does not involve running oc[sh]p command which lops off a good
> chunk of time
> for the reaping process to complete. On the distributed node, the
> reaping takes so long
> that it affects Nagios' scheduling and actual check execution times and
> thereby affects
> latencies.
>
> This seems like a serious impediment to normal functioning of a
> distributed
> Nagios setup. That is, in order to make all but the smallest
> distributed node setups
> work you have to come up with this roll-your-own setup. I haven't read
> the
> "new in Nagios 3" doc in a while. Is this something that is fixed in
> some way there?
I don't think so. I remember an email from Ton Voon some time ago asking
Ethan why the oc[hs]p command are run serially but I don't recall if
there was a reply or what else was said...
I believe it's either documented in the official doc or some
user-contributed doc that the oc[hs]p commands should return as soon as
possible. It's usually done in Perl using a fork:
if (fork==0) {
# send stuff via NSCA here...
}
exit(0);
Although it may work for you, that solution will not scale as well as my
OCP_Daemon because running the perl script to fork takes some time. Just
as an example, running the following command on my Nagios server takes
between 1 and 2.5 second:
$ time for ((i=0; i<100; i++)); do perl -e 'if (fork==0) { open (CAT,
"|/bin/cat >/dev/null") or die $!; print CAT
"$ARGV[0]\t$ARGV[1]\t$ARGV[2]\t$ARGV[3]\n"; close (CAT); }' host service
status result; done
That's obviously not counting the time it takes for Nagios to process
the macros, set the environment, etc. Send_nsca will also add much more
load to the system than a "cat >/dev/null". On any system running near
Nagios limitations that additional time will just be too much.
I don't know how many people use OCP_Daemon but I had reports from a few
people that greatly reduced their latency using it and I haven't had any
bug reported yet. I believe it's well documented as well, but If you
have any feedback on this I'll be happy to get it.
Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFHmATN6dZ+Kt5BchYRApGBAJ4jvi3bJJYONRVUgebEa2WBYJuUFgCeNN+j
tfBA9lbjORu63kPbg1aMpOo=
=sNiQ
-----END PGP SIGNATURE-----
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list