[naemon-dev] Ideas about future features
Robin Sonefors
ozamosi at flukkost.nu
Sat Dec 28 00:22:46 CET 2013
On 2013-12-27 02:06, Matthias Eble wrote:
> 1) have a feature to monitor per metric rather than per check_command.
> * Today, many plugins check lots of things.
> * Typical example is check_disk, check_snmp.
> * Depending on the configuration method, acknowledging a problem
> with /mytmpmount also disables notifications for /var
> * To fix that, we'd need to create a stricter plugin output
> standard that contains per-metric status codes.
> * metrics would be /mytmpmount-freespace, TCP-response-time,
> http-status-code or http-match-string
> * The core would need to create sub-services at run-time and
> populate their results.
> * Benefit: per metric actions and logging. Especially per metric
> downtime and acknowledgements
> * Maybe it could also be used for receiving snmp traps or log
> pattern matching checks
> * different alerting for different patterns/traps
>
> * Today, many folks wrap the plugin call and submit results to a
> passive check.
> * works, but all possible services need to be in the config.
> * That's where folks start generating nagios configs and reload
> the daemon.
> * Is that what we want? Maybe?
> * problems arise when there are syntax problems
>
> * raw proposal:
> define service {
> ...
> check_command check_disk
> contact_group os_admins
> define metric {
> metric_name ^/oracle.*
> contact_group oracle_admins
> }
> }
>
> * maybe another layer could be added for check_multi-like plugins.
> * but they could also be forced to structure metric names
I've been thinking about plugins and plugin architecture a bit.
The nagiosplugins project is talking about a new threshold format -
https://www.nagios-plugins.org/doc/new-threshold-syntax.html - to
achieve the same thing you want to solve in-core.
I think the nagiosplugins approach - basically, update all plugins to
support a much more complex (though easier to understand) threshold
format, because the old one was too complicated - is wrong. Programmers
write buggy code, and telling programmers to write more code leads to
more bugs (or at least I write buggy code, and I'm too stupid to write
plugins already - that's why I stick to the core :P )
But I'm also not sure how far into the core the I'd want to put it. What
if we, instead of either change the core or the plugins, write a plugin
wrapper that takes a threshold as described by nagiosplugins and a
plugin command line? It would simply parse the perfdata from the plugin,
the threshold from the CLI, throw away the plugin exit code, and send a
new, "imploved" exit code and stdout to naemon?
I feel this plugin wrapper approach would take the least amount of work
to implement. Which problems would it leave unsolved?
> [snip]
> What do you think? What's the focus of the dev-team?
So far, it seems the focus is mostly on cleanup. There's just *so*
*much* ancient *crap* lying around. Tens of thousands of lines of code
to create a ugly, useless web UI, which force me to ifdef every second
line - what? Three (or so, I lost count) different configuration parsers
for near-identical-yet-subtly-different configuration file formats -
really? And then the amount of special casing for things that you might
expect to behave similarly until you find out the hard way that they
really don't - for instance, I always thought the flapping calculation
was based on the last 20 (or so) check results, but nooo:
https://github.com/naemon/naemon-core/blob/master/naemon/flapping.c#L116
I think the current score is something like -120k lines compared to the
initial code import, but there's a lot more we could do.
Oh, and testing. One of the scariest things when starting on a new
codebase is realizing that there are no tests at all. The only thing
worse than that is finding a directory full of tests - granted, all
covered in cobweb and dust - and you think (or hope, or whatever you
call that feeling when you know you must never assume good things but
still want to) that you might have found The Book of Shadows in the
attic, but after dusting the code off and flipping through it, it turns
out that nobody has executed (or even compiled) any code here for
*years*, and half the tests test features that doesn't even exist
anymore. It dawns on you that somebody spent days - weeks, even -
writing tests to avoid regressions - and then didn't run the test and
thus didn't catch the regressions. See: t-tap, where a few of the tests
files actually work, and none of them have a working build system ATM.
As far as I'm going to go in terms of longer-term vision and
the-way-to-go-iness, I'd like to modularize the crap out of the core.
The nagios "core" is anything but, as explained above. It would be neat
to lift out a bunch of nagios functionality into a bundle of
preinstalled modules. This would serve two purposes: it would force us
to dogfood the broker API and thus help us improve it, and it would
compartmentalize features (new, and old) to avoid weird interactions
with other features.
The broker API as it exists is terrible - you're just given all of the
naemon internals, spotty and inconsistent hooks, and a "good luck". This
means that, as a core developer, any change I make at all is bound to
break some module, while as a module author, I need to learn all of the
core to write a module. And you want to store your own add-on
configuration/data? Hah! So, in the end, it's just easier to become a
core contributor, because who has the time not to?
What would happen if, to take an example that sounds weird but makes
some kind of sense, the flapping functionality was a module? That would
require some extra module functionality - modules would have to be able
to add configuration statements to the config (global and per-object)
for configuring flapping thresholds, and modules would have to be able
to couple state (is_flapping, last 20 check results) with the object and
have it persist between restarts. Now, what if this was the easiest,
most concise, and easiest-to-find-out-how way to do it?
I think a module should be able to do all these things - and if it could
do that, and if flapping was a module, I would not ever again have to
worry about flapping in the remaining core, nor would I wonder where all
special cases for flapping are handled - heck, I could even see if the
flapping feature has tests and how extensive they are, just from looking
at github.com/naemon/flapping ! Today, almost all features - including
flapping - is handled by the pair of ogres known as
handle_async_service_check_result/handle_async_host_check_result -
looking at the code, I have no idea what it will actually end up doing
for each case, but I'm quite sure a few of the code paths are buggy -
because that many untested if conditions just aren't going to all be
correct. Modularizing away the if statements (all of them, all over the
core) should render a more consistent, less buggy monitoring solution.
tl;dr: naemon should allow contributors to write modules that are much
more powerful than today's broker modules, to make it possible and easy
to write a module to add seemingly built-in functionality, like metrics
and exceptions - then, we could start to write such modules, go crazy,
and see what comes out!
More information about the Naemon-dev
mailing list