[Nagios-devel] Patches for inclusion to Nagios 4

Andreas Ericsson ae at op5.se
Thu Aug 8 17:52:22 CEST 2013


On 2013-08-08 15:17, Ton Voon wrote:
>
> On 7 Aug 2013, at 08:51, Andreas Ericsson wrote:
>
>> On 2013-08-06 19:16, Ton Voon wrote:
>>> Hi!
>>>
>>> We've published a list of patches for Nagios 4:
>>> http://www.opsview.com/whats-new/blog/opsview-patches-nagios-4
>>>
>>> We'd be happy if you could review if these are acceptable for
>>> future inclusion or if anyone else finds them useful.
>>>
>>
>> I'd like to get patches with commit messages and proper author and
>> signed-off-by info. Since we're using git for Nagios now, it'd go a
>> long way in making sure everyone gets credit for the work they've
>> done.
>>
>> The patches also need to apply cleanly to the latest master.
>>
>> You may want to clone the official Nagios repo, apply your patches
>> on top of it and then send me a pull request for github or some
>> such.
>>
>> git clone git://git.code.sf.net/p/nagios/nagios nagios-core
>>
>> should get you the very latest. If you apply your patches on top
>> of 'master' and make sure to always do "git pull --rebase" when you
>> want to get the latest and greatest you'll quickly see which
>> patches either have been applied or which no longer *can* be
>> applied. Then you can create a separate repository on github or
>> some such and push the changes there.
>
> OK, we'll convert them into git changes based off master.
> However....
>
> I'd like an assurance that the changes will be merged before we
> promise to convert. Style changes are fine, won't take much time and
> will not require retesting, but if we need to refactor object changes
> or make larger changes to logic, this will require retesting on our
> side. I'd like a reassurance that the time invested will result in a
> merge upstream, otherwise we're just wasting time for all of us.
>



> For instance, your assistance in the "environment macros per-command"
> was greatly appreciated and we've coded to the design agreed in the
> email conversations, but it hasn't been merged yet.
>

It has, but it hasn't. The original patch no longer applies, due to
extensive remodeling of the worker code. I still have it hanging
around though, and I'm fairly inclined towards using it for services
and hosts as well, so that environment macros can be set on a service 
level as well as on a command level. I've also been investigating the
chances of doing this without resorting to setenv() (in essence,
building a one-off block of the load-time environment macros, and then
extending that whenever we hit an environment variable).

But yea, I've got that one already.

> So I'd like to reverse the question and ask "which changes are most
> likely to be accepted, based on the amount of changes required" and
> we'll work through that list in order.
>

Thread-safe calls
It's harmless, and I'd apply it straight away if I had a sensible
commit message for it, but I don't know why you need this so I can't
really write one. While we're on the topic; Please write commit
messages in imperative form and present tense, as if you're giving
orders to the code for how it should change. Also give a short
rationale for why it should change that way. The rules about not
commenting out code you want to get rid of still applies though. It
just looks terribly hackish when patches meant for upstream contains
things like that.


Slice services within hosts
So long as it's configurable from cgi.cfg and the default stays the
same as it is today I'll apply it immediately. It has no impact on
loadable modules or other headaches, so that's a nobrainer, really.


Check command by time period
I feel this is somewhat lacking in efficiency and flexibility, and
a much cleaner solution would be to add a filtering functionality to
NERD so that checkresults can be shipped off to a third-party addon
which can transform checkresults and plugin output as much as it
likes.
Failing that (which would be enormously cool but also a lot of work),
I think it would most likely be best off as a separate module, with
custom variables or separate configuration to support it. Supporting
patches to run events when a timeperiod becomes active or inactive
would still be welcome, obviously.


Escalation via notification levels
This is best off written as a module, using custom variables to 
configure it. If core support is required to block notifications to a
particular contact, then such patches will naturally be accepted. The
normal NEBCALLBACK_CANCEL return code signalling should work just fine
for that.


Synchronising state data
Pretty invasive for quite a small benefit, and with enormous
complexities to deal with to make this work properly. How do you
handle reading the synced file while checks are in-flight and
awaiting being returned to the mothership? They may have been
spawned as a result of a bad checkresult on one node, but if the
sync status returns them to OK and the in-flight result sets them
as bad again, would that mean starting over on the retry-attempts?
I see nothing in the patch that handles situations like that, so
I'm forced to believe you haven't considered it. Considering the
fact that full reloads now take less than a second for your
recommended 250 hosts per system, I'm inclined to drop this patch
in favour of the regular retention data file unless you can
persuade me otherwise.


Dependency failure states
The patch is a bit hackish in that it internally submits a passive
check to make sure everything gets updated. That's not entirely
accurate though, and will cause statistical errors.
Rewrite it to just update the states as necessary and make sure
notifications and whatnot are sent if they should be.
That will also fix the problem of bugging out performance data
handling routines that expect to always get the same kind of data
from each check every time.
It will have to be configurable though. I'm thinking
"service_dependency_exec_fail_state=(warning,critical,unknown)"
and "service_dependency_exec_fail_output=Dependency failure".
I'd be all for changing the default, but for now the exec_fail_state
thing will have to be -1, meaning "leave it as-is", and I'll make
it default to UNKNOWN after 4.0 is released.
The same should go for service parents, btw.


Passive notifications respecting renotification interval
I'm inclined to making this default. It's not documented anywhere
which of those options take precedence. If notification_interval
is 0, then every notification should be sent for is_volatile
services, and if it's anything else we should respect the interval
no matter how many notifications happen to arrive. Make those
changes and I'll apply the patch and call it a bugfix.


Reducing SQL queries part 1:
No. Just no. This is a perfect place where you can use an array sized
after num_objects.services *inside the module* and look up the latest
version of the processed commandline there. Free it when you update it
and you'll be sure to always have it handy. Anyone who doesn't use
NDOUtils will otherwise be penalized for its shortcomings.


Reducing SQL queries part 2:
As above, but num_objects.hosts


Conditional debugging:
I'm more inclined towards making the debug-logging function a C99-style
variadic macro (or a gcc style one), while falling back to using the
current structure of a lot of unnecessary calls on too old compilers.
All that iffing should really be kept at arms length from the coder,
or debug-logging won't be used. I'm thinking
   dbg(DEBUGL_CHECKS, 0, "** Running async check of foo ... ");

so we get it short and sweet while we're at it.


FIFO processing of passive checkresults
I have several objections to this patch.
First off; You're sorting on alphanumerical name only, instead of mtime
first and then name in case of a tiebreak. That's just plain dumb and
has to be amended if this is to be anywhere near useful.
Secondly; This doesn't scale. I spent the better part of last year
removing everything that touches the filesystem on a scheduled and
frequent basis for those precise reasons. Putting crap like that back in
is not something I'll do without a very good reason.
Thirdly; It builds on obsolete code which is designated for oblivion,
so the code would be (relatively) shortlived anyway.
Fourthly; *Nothing* can be built on top of such a patch that doesn't
absolutely reek of hackishness and bad design.
A much more useful patch would be adding support for external commands
to the query handler, which doesn't have the limitations of the named
pipe (and can give responses to your requests so you know if they were
accepted or not) and use a secondary program to ship in passive results
any old way you want to the clean, nice and responsive interface which
is meant to be extended for things exactly like this.


Tolerant time jumps
I'm unsure how useful this is. How often does a system clock go
backwards except for daylight savings time? I also worry slightly
about the consequences. OTOH, most of the worry comes as a result from
the very lax default for time_change_threshold (900 seconds). A time
jump of 899 seconds wouldn't trigger a rescheduling, but would wreak
serious havoc with the checks.


Listing related contact groups
The patch needs to be changed so that the (possibly very long) list
of contactgroups isn't computed until it's requested instead of
before we even start macro substitution. Otherwise we'll be
computing that macro for each and every host and service check we
execute, which stupid beyond belief. Particularly considering the
fact that object config can't change during runtime (yet), so the
list could just as well be cached if one would so desire.
For bonus points, also add CONTACTLIST, which fetches contacts from
both the object directly and the contactgroups assigned.


Environment macros per-command
I wish you'd stop talking BS on that site of yours. Removal of
environment macros has very little to do with the speedups in Nagios 4.
Those improvements are the result of good design and proper
implementation of good design.
Now to technical issues; I have the patch (as I've said) and I do
intend to push it as soon as I've had time to polish it and bring it
forward to the current "master". It's quite invasive though, and
I've been trying to figure out what it is about it that feels wrong.
Then I've also forgotten about it, so this reminder was pretty good.
Thanks.


Allowing semi-colons in commands
Well, it's a neat idea, but the patch is so incredibly ugly that I
just can't bring myself to accept it. If you did something to keep
state and allowed semicolons in strings or as escaped creatures
(with a backslash, pretty please; There's enough home-cooked
escaping in Nagios already), I'd be willing to accept it.


Tilde symbol in commands
Absolutely out of the question. This patch deliberately breaks a
library to exploit its caller's fallback, which happens to suit your
particular needs in this case. No. Nonono. You can just configure
your way around it by using

    command_line  /bin/sh -c '$USER1$/plugin --arg=lala'

for the command in question, or implement it separately but making
it configurable as "expand_tilde_in_commands=<bool>" in nagios.cfg and
doing the replacement manually. It's not that hard, really.


Fix max_concurrent_checks decrementing
Obviously correct, although the patch is incomplete. In case we fail to
launch a job, it will have no reference in the scheduling queue but will
be marked as "is_executing" which will cause it to not even get
freshness checked. I'll amend the patch with everything it needs and
apply. Good catch though.


Extra external commands
This looks clean, correct and useful. I'll need a commit message, an
author and a signed-off-by before I can accept it though.


Thanks

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Developers mailing list