max_check_attempts vs. logfiles vs. notification

Carroll, Jim P [Contractor] jcarro10 at sprintspectrum.com
Fri Dec 6 00:44:42 CET 2002
Previous message: Permissions Error
Next message: qpage and nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've had a bit of a puzzler in logic.  I think.  It hasn't been a major
issue, but I suppose you could call it a 'fine tuning' aspect of Nagios
logic.

My host template for check-host-alive has max_check_attempts set to 10.

My service template for everything has max_check_attempts set to 3.

For the logfile scrubber checks (check_log via NRPE), I've set
max_check_attempts to 1.  Here's my reasoning.  Feel free to point out any
flaws (but be gentle ;).

If max_check_attempts is set to any value greater than 1 (eg, 3):
	If logfile scrubber finds matching string in /var/adm/messages:
		(eg, 'error') then return a 'critical'.
		Nagios increments count of soft errors of this service
check.
	Else:
		Reset soft error count.
	If soft count equals max_check_attempts:
		set hard failure to true.
	Else wait till next attempt.

But the nature of check_log will only report on matches since the last check
(ie, it doesn't re-examine parts of the logfile it's already examined).  So
with the next iteration, it will fail to find that string, the soft count
gets reset to 0, and no notifications are sent out.

That behaviour isn't desireable; it effectively renders check_log useless.

So, on setting max_check_attempts to 1 for checking logfiles, a soft failure
becomes a hard failure with the first iteration, notifications are sent out,
everyone's happy.

(Sidebar:  I should mention that notification_options *only for check_log
checks* is set just to 'c', therefore I would receive a page if 'error' is
seen in the logfile, but I won't get a page once the check recovers, eg, the
next time check_log scrapes the logfile and doesn't see any new occurrences
of 'error'.)

That approach in and of itself could be changed, if someone can provide a
convincing argument.  But that's not the main issue here.

In the situation where the NRPE daemon is down (unlikely, unless something's
misconfigured), all the various NRPE checks would normally return a critical
(due to a timeout).  I've created servicedependency rules for all the NRPE
checks, except for a dummy NRPE "is NRPE up" check; all the
servicedependency rules depend on the success of the dummy check working.

So far, so good.

However, with the aforementioned check_log checks, they go critical after
one successful match.  But they don't seem to care about the
servicedependancy rule.  I thought that this might be because of the fact
that they hit the max_check_attempts limit on the first go.  As a result,
the check_log checks go critical if NRPE is down, but at times they seem to
be the first alert that a host is down (or the only alert, if the host
reboots quickly enough).  It doesn't matter that I've defined the dependancy
on the dummy service; I won't see an alert on that if the host itself is
down, which is the proper response.  I'm thinking that it all depends on
which part of the service/host check cycle the events pan out.

Now that I've written this all out, I realize there are two problems.  I'm
uncertain how (in)separable they are from each other.

If I were to have a simple wish list of design regarding logfile checks, it
would be this:

- check for matching strings
- if string matches, send notification immediately, set critical alert in
web interface
- condition remains critical until acknowledged, even if string never shows
up again in logfile
- critical alert would need to be acknowledged

At this point, I haven't decided between the following paths:

- once acknowledged, alert is cleared from Nagios

or,

- once acknowledged, alert remains as acknowledged until subsequent action
is taken

If I were to go with the latter, I can envision a variation of check_log
which writes to a tempfile, and then a subsequent check_nrpe definition
would remove said tempfile.  Either that, or someone would need to login to
the client host and remove the file manually.

If we stick with the check_nrpe approach to removing the tempfile, I imagine
I'd have to create a cgi (referenced in the extended host info) which, once
clicked, would invoke the appropriate check_nrpe removal command.  Not
elegant, but workable.

Sorry for the ramble; things seem moderately clearer now.  If you've read
this far, I still wouldn't mind hearing about clever solutions to what I'm
trying to do.

jc


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Previous message: Permissions Error
Next message: qpage and nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list