How to troubleshoot when not receiving alerts?]
Marc Powell
marc at ena.com
Fri Jul 25 19:00:15 CEST 2008
On Jul 25, 2008, at 10:12 AM, John Oliver wrote:
> On Thu, Jul 24, 2008 at 11:12:55PM -0500, Marc Powell wrote:
>>
> I just checked nagios.cfg and:
>
> interval_length=1
All your intervals are in seconds then. The default is 60.
>>> thought I had the errors fixed... the last email I got said
>>> RECOVERED
>>> (even though I should be getting CRITICAL alerts, as there is 1%
>>> disk
>>> space left). I changed the notification_interval, and never saw
>>> another
>>> email.
>>
>> Does the web interface show the status as CRITICAL? If you received a
>> recovery notification the service was considered to be OK. What did
>> you fix?
>
> No. The web interface is really confusing for this server:
>
> ftp UP N/A 486d 17h 50m 1s
>
> It has not been up for 486 days. And this is the one device that has
You should verify your command{} definition for whatever the UP check
is. That's a check that you or your predecessor created and not a
'standard' check. If it's not been UP for 486 days then it seems
you're not checking what you think you're checking.
> N/A for last check. It's green and "UP". But that doesn't change the
> fact that nrpe reports 1% of disk space left, and that the nagios
> server
> can see that at least when I manually run the command.
Correct, they'd be completely unrelated.
> I'm starting to read about is_volatile, but I'm not really
> understanding
> it. One example is "things that automatically reset themselves to an
> "OK" state each time they are checked" That certainly isn't the case
> with a disk space check.
Correct. Most services are not volatile. An example would be an SNMP
trap. For every trap you receive, you want to send a notification
regardless of the status of the previous trap. A volatile service
sends out a notification for *every* non-OK check result for that
service.
> command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20 -c 10 -p
> /dev/mapper/VolGroup00-LogVol00
Warn if less than 20MB are free, Critical if less than 10MB are free
-- that common mistake I referenced.
> [root at cerberus ~]# su nagios -c "/usr/lib/nagios/plugins/check_nrpe -H
> ftp -c check_disk"
> DISK OK - free space: / 782 MB (0% inode=99%);|
> /=134653MB;142786;142796;0;142806
It's OK according to the criteria you've defined; you've got another
762M to go before warning ;-). 'check_disk --help' might be a good
read. You want to add a '%' to those numbers.
>> It seems to me you're not receiving notifications because hard state
>> changes are not occurring. This is generally desired behavior.
>
> That doesn't really make sense to me. I won't be alerted until the
> problem is fixed? Or gets worse?
You'll be alerted when the service changes state by default. OK ->
Warning, OK -> Critical, Warning -> Critical, Warning -> OK, Critical -
> OK. With a notification interval of 180, you should be re-notified
every 180 seconds _but_ only if the service is in a non-OK state.
You're not in a non-OK state so your next notification will be when a
state change occurs to Warning or Critical.
http://nagios.sourceforge.net/docs/3_0/notifications.html
> Here's what I'd like to wind up with... if available disk space drops
> below a certain point, I'd like to have an alert go out maybe once per
> day. If it drops past another point, into critical territory, I'd
> like
You should have enough information to fix the disk check now. For the
notifications, adjust notification_interval to be 86400 (1 day in
seconds).
> alerts to be sent out more frequently. But, whatever the interval is,
This is not possible AFAIK. notification_interval is the same, always.
Having a shorter notification_interval and looking at Escalations
might be a solution. Another would be to include that kind of logic in
your notification script.
> nagios should be alerting each time it sees low disk space. If it
Every check? If that's what you want then setting is_volatile would do
it.
> alerts once, and then assumes that it never has to alert again unless
> the problem gets fixed and then reappears, it's never going to get
> fixed. Once I have alerting working this way, I'll point the emails
> at
That sounds like a people issue ;) Normally, that's the behavior but
Escalations can help force the people issue.
--
Marc
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list