oscp command design and FIFO locking?
f1216 at yahoo.com
Sun Sep 11 18:37:20 CEST 2005
Thanks for the detailed reply. I've attempted to be a bit more clear
in the comments below.
--- Marc Powell <marc at ena.com> wrote:
> > -----Original Message-----
> > From: nagios-users-admin at lists.sourceforge.net [mailto:nagios-users-
> > admin at lists.sourceforge.net] On Behalf Of Fred
> > Sent: Sunday, September 11, 2005 9:12 AM
> > To: Nagios User
> > Subject: [Nagios-users] oscp command design and FIFO locking?
> >
> >
> > Does anyone have an idea why the oscp command (for distributed
> monitoring)
> > would
> > kick off more then one command at a time? For example, if there are a
> > number
> > of checks that are completed, nagios kicks off multiple oscp scripts
> > (submit
> > commands).
> Since the OCSP command can be and do anything, it must be run once per
> check. Nagios can't predict what you're using the OCSP command for and
> whether batching, as you seem to desire, would be applicable.
> Distributed monitoring is just one application of OCSP. If you really
> want the batching behavior, build it into your OCSP command.
It was my impression that this command was intended for distributed monitoring
and that other hooks exist to provide control to other types of commands
for other purposes.
> > This causes the design of the submit command to need to throttle the
> > access
> > to whatever resources it might need to touch. If using the default
> > send_nsca
> > command, there can now be multiple (and many multiple) send_nsca's
> kicked
> > off
> > and each of these on the target server will all be attempting to write
> to
> > the nagios FIFO. The nagios FIFO can get horribly overloaded. If the
> > nagios
> > master demon is not aggresively reading the FIFO
> (check_command_interval=-
> > 1)
> > then the demons can stack up and eventually consume socket resources
> and
> I handle approximately 3300 passive checks every 5 minutes on somewhat
> commodity hardware (quad pIII 800) using NSCA with no problems. I
> anticipate that I can double and possibly triple that number as the FIFO
> is empty approximately 1/3 of the time. Are you doing significantly more
> passive checks than that?
Most likely ... on one installation I have over 1040 nodes, over 10,500
checks, 99% of which are passive and involve plug-ins which write to the
nagios.cmd FIFO. Each compute node defines 10 passive service check
definitions, each service node defines an additional 10 active checks.
The nsca demon forks children to write to nagios.cmd as
a result of a send_nsca connection request. If at the same time, some plug-in
tries to write to this file, there is a good chance that the buffers can
be interspersed if both the nsca process and the plug-in do not observe any
kind of lock mechanism. This can also occur when nagios forks off multiple
service check plug-ins that both want to write to the FIFO. It took a system
configuration of about 120 or
so nodes for this to start happening for me. It wasn't consistent and it
isn't fatal. If you looked closely, the nagios.log would report an invalid
command and then read the next line of the FIFO and move on, however, the
data from that line would be lost. Since implementing a lock around writing
to the FIFO from all my plug-ins, this has not occurred. Note, in my smaller
configurations, I don't use nsca as there is no distributed monitoring. The
contention in these smaller systems is between concurrently running plug-ins.
> > memory etc. As far as I can tell, nsca doesn't lock the FIFO, which
> also
> > means that writes will get intermixed with writes from plug-ins that
> might
> > be
> > running on the master system. (I have seen this over and over)
> I don't see how. Local active checks, at least the standard plugins,
> don't use nagios.cmd in any way. This would also be contrary to the
> blocking behavior you comment on above where your OS is essentially
> 'locking' the FIFO until it has been cleared. As far as your OS is
> concerned, there is no distinction between NSCA trying to write to the
> pipe and some other process doing the same. While others are more versed
> in this than I am, it is my understanding that if the program is trying
> to write more data to the pipe than it can currently hold it will be
> prevented from doing so by the OS, only one process can write to the
> FIFO at a time and that all writes are atomic. This presumes that the
> plugin output is < the max FIFO length supported by your OS.
I use few local active checks. Those that I do use, typically are kicked
off to generate per-node data that is written to the nagios.cmd FIFO, one
line item for each node. With the FIFO on a 4k block filesystem, that isn't
too much room before it fills. At about 80-120 chars per message, it only
takes 30-50 messages
to fill the FIFO then the plug-in is blocked waiting for nagios to read it.
If nagios only reads it every 15 seconds, it could easily take over a minute
to read 128 messages (128 nodes). More then one process can write to a FIFO
at a time, it is just a unix file opened for append. The OS doesn't control
this, the user application has to. It gets worse ... if nagios spins off
more then one plug-in that in turn writes to the FIFO, and each of those
want to write say 128 lines of data, they can easily toast each other. Nagios
does have a setting to keep the number of concurrent processes to 1, but that
seems to be too big a hammer for this problem. In any case, locking between
plug-ins (and wrapping any existing ones with locks) works well. I also set
my nagios demon to aggresively read from the FIFO, otherwise things start
timing out (with a service check timeout at say 60-120 seconds)
While I have few local checks, they are the core of my monitoring system as
they are resposnible for filling in all the per-node information for the
majority of the passive checks, for example, I have a syslog monitor plugin
that runs and parses the recent syslog messages, compares against interesting
patterns, and then formats a line for each node that has something interesting
and writes that to the FIFO, for those nodes that do not have any interesting
content, it formats a line that says nothing matched (if I didn't do that, the
service check would never fill any data in or it would go stale) Other
plug-ins report per-node statistics and format this into the FIFO. Each node
has passive check definitions for these results.
> >
> > To avoid this, I have had to implement serious locking in all plug-ins
> and
> > not use nsca as it has no locking mechanism (that I know of).
> I'm curious about how you've done this. What exactly are you locking?
> How is it helping? NSCA shouldn't need locking as it depends on your OS
> to control access to the FIFO.
> > Right now I am fighting with the oscp commands that can launch dozens
> of
> > copies at a time and each of these (in my case) write to a local file
> that
> > will eventually be pushed up to the master and written (while locking)
> the
> > nagios FIFO.
> >
> > So ... I guess my questions are:
> >
> > 1) Should nagios be forking off more then one oscp command at a time?
> Yes, one per check.
> > 2) Has anyone else run into FIFO corruption because of the lack of
> > advisory
> > locking in all the plug-ins?
> Not here in almost 4 years of using Nagios/Netsaint.
Again, thanks for the input.
> --
> Marc
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO
> September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list