Problem with high latencies after going distributed

Frost, Mark {PBG} mark.frost1 at pepsi.com
Wed Jan 23 16:41:21 CET 2008


 

>-----Original Message-----
>From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca] 
>Sent: Tuesday, January 22, 2008 10:29 PM
>To: Frost, Mark {PBG}
>Cc: Nagios Users
>Subject: Re: [Nagios-users] Problem with high latencies after 
>going distributed
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On 22/01/08 09:13 PM, Frost, Mark {PBG} wrote:
>>  
>> 
>>> -----Original Message-----
>>> From: Steve Shipway [mailto:s.shipway at auckland.ac.nz] 
>>> Sent: Tuesday, January 22, 2008 8:45 PM
>>> To: Frost, Mark {PBG}; Nagios Users
>>> Subject: RE: [Nagios-users] Problem with high latencies after 
>>> going distributed
>>>
>>> We've just done exactly the same (Nagios 2.9), and we have 
>a comparable
>>> size of system (actually a bit larger - 713 hosts, 5834 services).
>>> After going distributed, we too have this insanely high 
>latency on the
>>> satellites.
>>>
>>> The only possible cause is the OCSP command slowing things 
>>> down somehow.
>>> This is using the supplied send_nsca call to send the 
>status off to the
>>> central server...
>>>
>>> define command {
>>>    command_name    relay
>>>    command_line    $USER1$/submit_check_result "$HOSTNAME$"
>>> "$SERVICEDESC$" "$SERVICESTATEID$" "$SERVICEOUTPUT$"
>>> }
>>>
>>> So it should work.  I guess things would be better if it 
>packaged the
>>> updates up into batches, although it cant do that normally.
>>>
>>> I think it might be better to make the OCSP command just dump 
>>> the status
>>> to a file, and then have a cronjob every 60 seconds that 
>reads the file
>>> and sends the statuses off as a batch.  I will try this here, 
>>> when I get
>>> the chance.
>>>
>>> Steve
>> 
>> 
>> But if the submit_check_result is running slowly, that would 
>only affect
>> the service
>> execution time wouldn't it?  My understanding of check 
>latency is that
>> it's the difference
>> in time between when Nagios schedules a check to run versus the time
>> that the check
>> actually starts to execute.
>
>You're right, but you're just missing one detail. Nagios runs checks in
>parallel and then reaps all the service results at once. While it's
>reaping it can't schedule other checks and it is in the reaping state
>that Nagios runs host check, event handlers, performance data commands
>and oc[hs]p commands. All this is done serially and can slow down
>significantly each service reaping run and thus delay the execution of
>further checks.
>
>I although I never built a distributed system, I designed mine to be
>easily distributed. Moreover, I used a technique I developed for
>latency-free performance-data processing (That I still heavily use BTW)
>to create a way to distribute check results to to a distributed central
>server in the same latency-free way (Was more like a fun project as I
>don't use it myself yet).
>
>Basically you use the host/service performance data files to get the
>data, but instead of writing to a file you write it to a named pipe
>(fifo). That pipe is then read by a high-performance non-blocking
>event-based Perl daemon (yeah I know that looks like marketing terms,
>but I can explain further each of them if you like) that forks 
>send_nsca
>processes to send results in bulk (normally every few seconds though).
>
>So Nagios doesn't even loose time rotating a file and all your checks
>are transmitted almost instantly. See this wiki page for 
>details and code:
>
>http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon
>
>
>Thomas

Interesting.  Thanks for the explanation.  If I understand this right,
the reason I
don't see this issue on my old non-distributed system is that when the
reaping occurs
there, it does not involve running oc[sh]p command which lops off a good
chunk of time
for the reaping process to complete.  On the distributed node, the
reaping takes so long
that it affects Nagios' scheduling and actual check execution times and
thereby affects
latencies.

This seems like a serious impediment to normal functioning of a
distributed
Nagios setup.  That is, in order to make all but the smallest
distributed node setups
work you have to come up with this roll-your-own setup.  I haven't read
the
"new in Nagios 3" doc in a while.  Is this something that is fixed in
some way there?

Thanks

Mark

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list