RFC/RFP Nagios command workers
Matthieu Kermagoret
mkermagoret at merethis.com
Mon May 23 11:37:24 CEST 2011
On Wed, May 18, 2011 at 4:43 PM, Andreas Ericsson <ae at op5.se> wrote:
> Since discussion on the last requests for comments and patches has
> splintered off and gotten somewhere, it's time for the next mail in
> the series of what us awesome gods of the Nagios core decided to
> work on for the next grand version of Nagios.
>
Congratulations ! I'm glad to see Nagios' development moving forward !
> This idea comes from Shinken, mod_gearman and DNX which have all
> implemented versions of it, so creds and kudos to the authors of
> those projects.
>
> Currently, Nagios eats quite a lot of I/O when writing, scanning for
> and reading the check result files. This becomes especially noticeable
> in large installations. There's also the problem of Nagios using a
> lot more copied memory per fork than it's supposed to, and the fact
> that embedding scripting languages inside the Nagios core to speed
> up execution is a potentially disastrous action (as the debacle with
> embedded Perl has proven to be).
>
Good analysis to which I totally agree.
> The idea to solve all of that is to fork() off a set of worker
> threads at startup that free()'s all possible memory and re-connects
> to the master process via a unix domain socket (or network socket
> that by default only listens to the localhost address) to receive
> requests to run commands and return the results of those commands.
>
While I agree that distributing check execution among multiple
processes can be a really good idea, I don't know if this should be
implemented in the Core. This can add significant complexity to the
code while not being useful to all Nagios users. The Core already have
a proper API that allows modules to execute checks themselves, so why
not rely on it for distribution and improve the existing command
execution mechanism ?
As you say, one of the root problem of the current implementation, is
the use of temporary files, as this consumes much I/O when writing,
scanning and reading them. Also the Nagios Core process is fork()ed
multiple times and this might consume unnecessary CPU time. So I
propose the following :
1) Remove the multiple fork system to execute a command. The Nagios
Core process forks directly the process that will exec the command
(more or less sh's parsing of command line, don't really know if this
could/should be integreted in the Core).
2) The root process and the subprocess are connected with a pipe() so
that the command output can be fetched by reading the pipe. Nagios
will maintain a list of currently running commands.
3) The event loop will multiplex processes' I/O and process them as necessary.
> This has several benefits, although they're not immediately user
> visible.
> * I/O load will decrease significantly, leaving more disk throughput
> capacity for performance data graphing or status data database
> solutions.
Still holds but to a smaller extent, as the "problem of Nagios using a
lot more copied memory per fork than it's supposed to" is not solved.
This could be solved with a module however, see below.
> * Scripting languages can be embedded regardless of memory leaks and
> whatnot, since worker daemons can be killed off and respawned every
> 50000 checks (or something), thus causing the kernel to clean up
> any and all leaked memory.
There could be modules that override checks and forward them to
interpreter daemons on a per-language basis for example.
> * Nagios core can be single-threaded, which means higher portability,
> less memory usage and more robust code.
Still holds.
> * Eventbroker modules that use a socket to communicate with an external
> daemon can instead register a handler for inbound packets and then
> simply "own" that connection and get all future packets from it
> forwarded as eventbroker events. This will ofcourse reduce the module
> complexity quite a bit for nearly all much-used modules today (Merlin,
> livestatus, DNX, mod_gearman, NDOUtils, etc...)
Still holds, instead of multiplexing on socket FD, multiplex on pipe FD.
> * It becomes possible to receive responses from Nagios when submitting
> commands (the current FIFO pipe is one-way communication only).
>
See discussion about the command pipe below.
> Drawbacks:
> * It's quite a large and invasive change to the nagios core which
> will require a lot of testing.
>
This would be a less invasive and smaller change but would still
require testing ;-)
The worker system could still be implemented and used only by users
who need it (but that's what DNX and mod_gearman do). I believe it is
better to leave the default command execution system as simple as it
is right now (but improve it) and leave distribution algorithms to
modules. I can imagine multiple reasons for which one would want to
distribute checks among workers :
- less overhead per fork() (the problem you raised)
- embedded interpreter (your raised this also)
- per network (the worker closer to a node execute its check)
- randomly (clustering)
- ...
So I don't know if embedding a particular policy within the Core is a
good thing. I'd rather see an official module (that might be included
by default) for the workers system.
> Please note that a compatibility daemon which continues to parse the
> simple FIFO will ofcourse have to be implemented so that current scripts
> and whatnot keep on working, and the API to scan for and read check
> result files will also remain for the foreseeable future, although
> possibly implemented as an external helper program which can ship
> check results into the Nagios socket instead.
>
So in fact you plan removing the old FIFO and doing all stuffs through
the socket ? What about acknowledgements or downtimes ? Could they be
sent through the socket too or would there be another system ?
Best regards,
--
Matthieu KERMAGORET | Développeur
mkermagoret at merethis.com
MERETHIS est éditeur du logiciel Centreon.
------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its
next-generation tools to help Windows* and Linux* C/C++ and Fortran
developers boost performance applications - including clusters.
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel
More information about the Users
mailing list