How to reduce a very high latency number
Jacob Ritorto
jritorto at nut.net
Tue May 23 15:48:19 CEST 2006
Greetings,
A colleague of mine (poctum) and I ran into something like
this while using nsca and have crafted a similar solution. We
observed that send_nsca was sending only one result to the central
Nagios server per connection. Testing revealed that send_nsca was
capable of handling thousands of results per connection. Sending only
one at a time was resulting in lots of dropped data because there were
nominally about 5 results derived per second. We enabled
aggregate_status_updates in the nagios.cfg file, but this yielded no
improvement in the result submissions. BTW, this is Nagios-2.2 and
nsca-2.6 on Solaris 10. Our workaround is a quick and dirty but
efficient solution. It may not be as refined as trask's and relies on
nuances of unix file handling algorithms to get the job done. That
said, it's working perfectly for us. As this seems to work well, but
may violate Ethan's design intentions, your feedback/input is
requested. Deploy at your own risk.
Jacob Ritorto, Lead UNIX Server Operations Engineer
InnovationsTech
Here's our solution:
1) Altered last line in
/opt/nagios/libexec/eventhandlers/submit_check_result thusly. It
basically concatenates check results to a temp file.
#/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" |
/opt/nagios/bin/send_nsca 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg
/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >>
/opt/nagios/var/results.waiting
2) Created a daemon process called reap (managed by smf, but it has
been up for a month so far, so may be ok as an init.d script) to pull
aside the aforementioned temp file (results.waiting) every five
seconds and send the bits off to the central Nagios server (note that
original file is re-created immediately via step 1 above). This
probably only works perfectly on unix & unix-like systems due to the
nature of files hanging around intact until the last program
referencing them has exited. It's been some time, but the last I
checked, DOS/WINxxxx doesn't treat files this way. Here's the simple
little reap daemon:
# cat /opt/nagios/bin/reap
#!/usr/bin/tcsh
while (1)
sleep 5
mv /opt/nagios/var/results.waiting /opt/nagios/var/results.sending
cat /opt/nagios/var/results.sending | /opt/nagios/bin/send_nsca
172.16.x.x -c /opt/nagios/etc/send_nsca.cfg >/dev/null
end
Summary: Slave Nagios servers now store up check results in the temp
file for 5 seconds, then they get shipped off to nsca on the central
Nagios machine in one swoop instead of one-at-a-time.
*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
From: Trask <trasko at gm...>
Re: How to reduce a very high latency number
2006-05-23 03:50
On 5/22/06, srunschke at abit.de <srunschke at abit.de> wrote:
> nagios-users-admin at lists.sourceforge.net schrieb am 17.05.2006 20:09:16:
>
> To me this is obviously a performance issue related to hardware.
> Your machines have way too few RAM. It is totally not possible to
> run 1800 checks on a 512MB machine in a timely manner.
>
I figured this out this past Saturday. It is not any lack of the
hardware. I was seeing negligible load nor an excessive use of
memory. No configuration change I made seemed to have any appreciable
effect on the latency times I was getting. I ended up doing a "top"
with 1 second intervals and just watching it for awhile. I noticed
that sometimes there would be a good number of nagios processes
20-30-40 or so, but the majority of the time there were only 2, 3 or 4
processes. Although I do not know exactly *why* this was happening,
it ends up the during the time where there was 2-4 processes running,
2 of them were always the"submit_passive_check" script and
"send_nsca". It appears that this is being done serially (ie not in
parallel) and ends up blocking subsequent checks until they are done.
I would see these 2 processes running (with steadily increasing PIDs)
for up to a minute and then a short-lived (4-5 seconds) "explosion" of
nagios processes (service/host checks). After this flurry of
activity, it would be another 60 seconds or so of just 2-4 processes.
I resolved this problem by changing by "submit_passive_check" script.
Below are some sample scripts, both old and new. The short of it is
like this: Previously, the "submit_passive_check" script did a printf
of the data in the appropriate format and piped it to the "send_nsca"
command (in a shell script). I have eliminated this bottleneck by
having "submit_passive_check" redirect its output to a named pipe and
then having another script feed "send_nsca" with that data as it comes
in to the named pipe.
Latency times have dropped from the 600-700 seconds to 0.2 seconds on
the worst server and from 45-55 seconds to 0.06 on the 2nd to worst.
That's more like it!
Below are a few scripts w/ notes as to what each one is. Thanks to
everyone who offered help.
~trask
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list