How to reduce a very high latency number

Marc Powell marc at ena.com
Wed May 24 16:47:07 CEST 2006


I have 3 separate central servers (2 production and 1 for testing), all
running nsca as a daemon and all receiving the same 3590 passive service
results every 5 minutes. I've never had a problem with missed checks or
high latency numbers. Stats from my clients --

Client1 submits 504(x3) passive checks -
Metric	Min.	Max.	Average
Check Execution Time:  	0.01 sec	21.40 sec	3.394 sec
Check Latency:	0.01 sec	2.18 sec	0.216 sec
Percent State Change:	0.00%	5.72%	0.07%

Client2 submits 1260(x3) passive checks -
Metric	Min.	Max.	Average
Check Execution Time:  	0.04 sec	35.03 sec	6.789 sec
Check Latency:	0.00 sec	3.95 sec	0.839 sec
Percent State Change:	0.00%	11.84%	0.08%

Client3 submits 824(x3) passive checks -
Metric	Min.	Max.	Average
Check Execution Time:  	0.09 sec	15.79 sec	7.999 sec
Check Latency:	0.01 sec	7.97 sec	2.034 sec
Percent State Change:	0.00%	5.86%	0.02%

Client4 submits 293(x3) passive checks -
Metric	Min.	Max.	Average
Check Execution Time:  	0.09 sec	10.36 sec	7.584 sec
Check Latency:	0.00 sec	1.15 sec	0.249 sec
Percent State Change:	0.00%	12.11%	0.12%

Client5 submits 720(x3) passive checks -
Metric	Min.	Max.	Average
Check Execution Time:  	0.11 sec	16.81 sec	8.844 sec
Check Latency:	0.00 sec	6.81 sec	1.009 sec
Percent State Change:	0.00%	11.84%	0.04%

All very reasonable to me. Clients 3-5 are single proc PIII-800's
running nagios and cricket and are due to be upgraded but even an
average latency of 2 seconds there is nothing to fret over at all.

--
Marc

> -----Original Message-----
> From: nagios-users-admin at lists.sourceforge.net [mailto:nagios-users-
> admin at lists.sourceforge.net] On Behalf Of Morris, Patrick
> Sent: Wednesday, May 24, 2006 3:54 AM
> To: Greg Cope; Jacob Ritorto
> Cc: nagios-users at lists.sourceforge.net
> Subject: RE: [Nagios-users] Re: How to reduce a very high latency
number
> 
> How are you guys running the nsca daemon?  I've got systems that
perform
> thousands of checks with no problem.
> 
> I'm looking at a system right now that submits over 5300 checks to a
> central server running nsca via xinetd, and it has a average service
> latency of .153 secs.
> 
> -----Original Message-----
> From: nagios-users-admin at lists.sourceforge.net
> [mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of Greg
Cope
> Sent: Wednesday, May 24, 2006 1:47 AM
> To: Jacob Ritorto
> Cc: nagios-users at lists.sourceforge.net
> Subject: Re: [Nagios-users] Re: How to reduce a very high latency
number
> 
> Jacob,
> 
> I noticed the same thing today.
> 
> We run a few distributed servers that do about 150 checks (at the
> moment) and submit this to our central server.
> 
> That's allot of send_nsca processes that get spawned.
> 
> I like you fix!
> 
> send_nsca would not appear to be scallable for those running lots of
> passive checks with distributed systems.
> 
> Greg
> 
> On Tue, 2006-05-23 at 09:48 -0400, Jacob Ritorto wrote:
> > Greetings,
> >        A colleague of mine (poctum) and I ran into something like
this
> 
> > while using nsca and have crafted a similar solution.  We observed
> > that send_nsca was sending only one result to the central Nagios
> > server per connection.  Testing revealed that send_nsca was capable
of
> 
> > handling thousands of results per connection.  Sending only one at a
> > time was resulting in lots of dropped data because there were
> > nominally about 5 results derived per second.  We enabled
> > aggregate_status_updates in the nagios.cfg file, but this yielded no
> > improvement in the result submissions.  BTW, this is Nagios-2.2 and
> > nsca-2.6 on Solaris 10.  Our workaround is a quick and dirty but
> > efficient solution.  It may not be as refined as trask's and relies
on
> 
> > nuances of unix file handling algorithms to get the job done.  That
> > said, it's working perfectly for us.  As this seems to work well,
but
> > may violate Ethan's design intentions, your feedback/input is
> > requested.  Deploy at your own risk.
> >
> > Jacob Ritorto, Lead UNIX Server Operations Engineer InnovationsTech
> >
> > Here's our solution:
> >
> > 1) Altered last line in
> > /opt/nagios/libexec/eventhandlers/submit_check_result thusly.  It
> > basically concatenates check results to a temp file.
> >
> > #/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" |
> > /opt/nagios/bin/send_nsca 172.16.x.x -c
/opt/nagios/etc/send_nsca.cfg
> >
> > /bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >>
> > /opt/nagios/var/results.waiting
> >
> >
> > 2) Created a daemon process called reap (managed by smf, but it has
> > been up for a month so far, so may be ok as an init.d script) to
pull
> > aside the aforementioned temp file (results.waiting) every five
> > seconds and send the bits off to the central Nagios server (note
that
> > original file is re-created immediately via step 1 above).  This
> > probably only works perfectly on unix & unix-like systems due to the
> > nature of files hanging around intact until the last program
> > referencing them has exited.  It's been some time, but the last I
> > checked, DOS/WINxxxx doesn't treat files this way.  Here's the
simple
> > little reap daemon:
> >
> > # cat /opt/nagios/bin/reap
> > #!/usr/bin/tcsh
> > while (1)
> >  sleep 5
> >  mv /opt/nagios/var/results.waiting /opt/nagios/var/results.sending
> > cat /opt/nagios/var/results.sending | /opt/nagios/bin/send_nsca
> > 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg >/dev/null end
> >
> >
> > Summary:  Slave Nagios servers now store up check results in the
temp
> > file for 5 seconds, then they get shipped off to nsca on the central
> > Nagios machine in one swoop instead of one-at-a-time.
> >
> >
> > *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
> >
> >
> >
> > From: Trask <trasko at gm...>
> > Re: How to reduce a very high latency number
> > 2006-05-23 03:50
> >
> > On 5/22/06, srunschke at abit.de <srunschke at abit.de> wrote:
> > > nagios-users-admin at lists.sourceforge.net schrieb am 17.05.2006
> 20:09:16:
> > >
> > > To me this is obviously a performance issue related to hardware.
> > > Your machines have way too few RAM. It is totally not possible to
> > > run 1800 checks on a 512MB machine in a timely manner.
> > >
> >
> > I figured this out this past Saturday.  It is not any lack of the
> > hardware.  I was seeing negligible load nor an excessive use of
> > memory.  No configuration change I made seemed to have any
appreciable
> 
> > effect on the latency times I was getting.  I ended up doing a "top"
> > with 1 second intervals and just watching it for awhile.  I noticed
> > that sometimes there would be a good number of nagios processes
> > 20-30-40 or so, but the majority of the time there were only 2, 3 or
4
> 
> > processes.  Although I do not know exactly *why* this was happening,
> > it ends up the during the time where there was 2-4 processes
running,
> > 2 of them were always the"submit_passive_check" script and
> > "send_nsca".  It appears that this is being done serially (ie not in
> > parallel) and ends up blocking subsequent checks until they are
done.
> > I would see these 2 processes running (with steadily increasing
PIDs)
> > for up to a minute and then a short-lived (4-5 seconds) "explosion"
of
> 
> > nagios processes (service/host checks).  After this flurry of
> > activity, it would be another 60 seconds or so of just 2-4
processes.
> >
> > I resolved this problem by changing by "submit_passive_check"
script.
> > Below are some sample scripts, both old and new.  The short of it is
> > like this:  Previously, the "submit_passive_check" script did a
printf
> 
> > of the data in the appropriate format and piped it to the
"send_nsca"
> > command (in a shell script).  I have eliminated this bottleneck by
> > having "submit_passive_check" redirect its output to a named pipe
and
> > then having another script feed "send_nsca" with that data as it
comes
> 
> > in to the named pipe.
> >
> > Latency times have dropped from the 600-700 seconds to 0.2 seconds
on
> > the worst server and from 45-55 seconds to 0.06 on the 2nd to worst.
> > That's more like it!
> >
> > Below are a few scripts w/ notes as to what each one is.  Thanks to
> > everyone who offered help.
> 
> 
> -------------------------------------------------------
> All the advantages of Linux Managed Hosting--Without the Cost and
Risk!
> Fully trained technicians. The highest number of Red Hat
certifications in
> the hosting industry. Fanatical Support. Click to learn more
> http://sel.as-us.falkag.net/sel?cmd=k&kid7521&bid$8729&dat1642
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null


-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid7521&bid$8729&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list