Core 4 Remote Workers
Andreas Ericsson
ae at op5.se
Sun Feb 3 00:52:56 CET 2013
On 02/02/2013 03:12 PM, Eric Stanley wrote:
> All,
>
> I've been giving some thought to remote workers for core 4 and wanted to
> run those thoughts by this list. I see remote workers as a very useful
> extension to the worker concept in core 4.
>
> To implement remote workers, I think there are about 4 basic things that
> would need to be done.
> 1. Implement the ability to listen to multiple query handler interfaces
> (precursor to #2)
This is trivial. Simply create an additional socket and set it up with
the iobroker the exact same way everything else is handled.
> 2. Implement the ability to create and listen on TCP socket query
> handler interfaces.
This is also trivial, and the name "nsock_unix()" sort of suggests that
there will be an "nsock_inet()" coming along to keep it company (which
has been the thought all along).
However, I've always intended for that to be a separate daemon, which
can live in a chroot jail and only forward requests to the main Nagios
daemon that it knows is kosher. That would keep us from having to do
all the input validation and such in the core.
> 3. Add a host key to the worker registration to allow workers to specify
> the host(s) for which it will handle checks.
Not really difficult, although I suspect one will want to use groups
instead of specific hosts, and also use the address which the other
node is connecting from as the host to monitor (so one can have self-
monitoring servers that phone in to Nagios with their results).
> 4. Write a stand-alone remote worker that can connect to the core
> instance via TCP.
>
Trivial, since lib/worker.c contains 99% of the code needed to write a
worker.
> The reason I have steps 1 and 2, instead of combining them is first,
> because a generalized solution is more extensible and second, I think
> having multiple TCP listeners is a reasonable use case where you have a
> multi-homed system, but you may not want to listen on all interfaces.
>
That can be firewalled away quite trivially, so no need for us to handle
that with code that might break (as I suspect it will see little testing).
> The host key should be allowed to specify one or more IP addresses, IP
> subnets, contiguous IP address ranges, host names and host name
> patterns/wildcards (i.e. *.example.com). If multiple workers register
> for the same host, some sort of distribution mechanism should be used to
> load balance the workers.
>
Umm... Is this what the remote worker should request? If so, we're doing
a pretty major change in Nagios where a hosts address is always just a
string that we pass to the plugins, and it won't be long until people
start requesting regex matching, subdomain matching and whatnot for it,
and we'll have to start resolving hostnames.
I'd say just go with hostgroups instead. It's easier, and people will
have to do some minor configuring of remote workers anyway, so saying
"hostgroups=core-routers" in that config in addition to ip and port
to Nagios isn't such a big chore.
> Using the second criteria of host to determine which worker gets the
> check raises the question of the order of precedence for the criteria.
> Initially, I think the host should have precedence over plugin, but I
> can see implementing and order of precedence option in the core
> configuration file. This would be more important if additional worker
> selection criteria were added.
>
Object over check type, any day. We may have to add a "check_type" thing
to command objects though, so workers can register for only local checks
and still have their http checks and whatnot done from remote, where
they make more sense. This requires some thinking.
> The communication between the remote worker and the core process should
> be able to be protected by SSL. The remote worker will need a mechanism
> to retry the connection in the event the network drops the connection.
>
Retrying the connection is the easy part. What should it do with the
jobs its running while the upstream connection is dead? More importantly,
how should core Nagios react to the checks it's supposed to run when the
connection is down? Issuing "check_disk / -w 90 -c 95" or something is
a pretty bad idea.
Encryption is a must, ofcourse, as the packets will have to contain
passwords some of the time. There's a libssh2 available which we should
be able to use to set up preshared key authentication with security
that even NSA will approve of.
> I realize this is a sizable change and we may not want it to happen
> before the release of 4.0. Thoughts on this are welcome.
>
4.1, I'd say. At the earliest.
> Further down the road, I can see developing a remote worker proxy, whose
> sole job is to broker the communication between core and even more
> remote workers. This would enable a tree-shaped worker hierarchy for
> monitoring environments that are both large and dispersed geographically
> and/or topologically. This would require a re-registration process so
> the proxy workers could keep core updated with their abilities as
> leaf-node workers connected and disconnected.
>
Ugh, no. We'd be setting ourselves up for huge bottleneck issues with
that, and very, very few people would want to use it. Networks large
enough to use hundreds of workers will always have distributed
responsibility as well, so more than one node in the network will need
to have a user interface. We did this investigation pretty thoroughly
when we hacked up Merlin, and that's one of the reasons it's designed
the way it is.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
More information about the Developers
mailing list