How to troubleshoot when not receiving alerts?]

John Oliver joliver at john-oliver.net
Fri Jul 25 17:12:13 CEST 2008
Previous message: How to troubleshoot when not receiving alerts?]
Next message: Notification using notify-html-email.sh
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Jul 24, 2008 at 11:12:55PM -0500, Marc Powell wrote:
> 
> On Jul 24, 2008, at 4:59 PM, John Oliver wrote:
> 
> > It was working yesterday.  I was getting emails from this plugin every
> > 24 minutes (notification_interval was 1440).  They were all errors.  I
> 
> Unless you've changed interval_length from it's default of 60, all  
> _interval parameters are minutes, not seconds so that seems strange.

I haven't changed anything, but then, I didn't set this box up, either.
It's a legacy system to me.

I just checked nagios.cfg and:

interval_length=1

> > thought I had the errors fixed... the last email I got said RECOVERED
> > (even though I should be getting CRITICAL alerts, as there is 1% disk
> > space left).  I changed the notification_interval, and never saw  
> > another
> > email.
> 
> Does the web interface show the status as CRITICAL? If you received a  
> recovery notification the service was considered to be OK. What did  
> you fix?

No.  The web interface is really confusing for this server:

ftp	UP      N/A     486d 17h 50m 1s

It has not been up for 486 days.  And this is the one device that has
N/A for last check.  It's green and "UP".  But that doesn't change the
fact that nrpe reports 1% of disk space left, and that the nagios server
can see that at least when I manually run the command.

> > This AM, I set notification_interval to 60  I should get an email  
> > every
> > minute.  I'm not.  And, yes, I'm restarting nagios ;-)
> >
> > Here's the stanza in services.cfg:
> >
> > define service{
> >        use                             generic-service         ; Name
> > of service template to use
> >        host_name                       ftp
> >        service_description             Disk Space
> >        is_volatile                     0
> >        check_period                    normalbusinesshours
> >        max_check_attempts              3
> >        normal_check_interval           120
> >        retry_check_interval            10
> >        contact_groups                  FTP_Alerts
> >        notification_interval           60
> >        notification_period             normalbusinesshours
> >        notification_options            w,u,c,r
> >        check_command                   check_remote_disk1
> >        register                        1
> >        }
> 
> Having notification_interval < normal_check_interval might be  
> problematic. I am under the distinct impression that notification  
> logic is only called after a check of the host/service. I don't have  
> convenient access to the source right now to verify though.

OK, I just made notification_interval 180 to test.

> Additionally, this service is not set is_volatile (they normally are  
> not volatile). Nagios will only send a notification for it for a hard  
> state _change_ unless there is some other escalation definition  
> applied to it. This is normal.

I'm starting to read about is_volatile, but I'm not really understanding
it.  One example is "things that automatically reset themselves to an
"OK" state each time they are checked"  That certainly isn't the case
with a disk space check.

> > And I can check the remote system from the command line:
> >
> > [root at cerberus ~]# /usr/lib/nagios/plugins/check_nrpe -H ftp -c
> > check_disk
> > DISK OK - free space: / 2321 MB (1% inode=99%);|
> > /=133114MB;142786;142796;0;142806
> 
> We'd have to see the actual command definition for check_disk from  
> nrpe.conf on the remote host but it seems that you've indicated that  
> 1% free disk space is OK. Does it happen to be that you've specified  
> your warning and critical levels in KB, not %? That's an easy mistake  
> to make. Also, as a general rule you shouldn't test nagios plugins as  
> root. It's common, but not likely in this case, that you'll see  
> different results due to the difference in privilege levels between  
> nagios and root.

command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20 -c 10 -p
/dev/mapper/VolGroup00-LogVol00

[root at cerberus ~]# su nagios -c "/usr/lib/nagios/plugins/check_nrpe -H
ftp -c check_disk"
DISK OK - free space: / 782 MB (0% inode=99%);|
/=134653MB;142786;142796;0;142806

:-)

> > Yes, I just noticed the discrepancy between contact_groups in
> > services.cfg and hosts.cfg  I doubt that's the issue, as I was getting
> > emails yesterday.
> 
> It seems to me you're not receiving notifications because hard state  
> changes are not occurring. This is generally desired behavior.

That doesn't really make sense to me.  I won't be alerted until the
problem is fixed?  Or gets worse?

Here's what I'd like to wind up with... if available disk space drops
below a certain point, I'd like to have an alert go out maybe once per
day.  If it drops past another point, into critical territory, I'd like
alerts to be sent out more frequently.  But, whatever the interval is,
nagios should be alerting each time it sees low disk space.  If it
alerts once, and then assumes that it never has to alert again unless
the problem gets fixed and then reappears, it's never going to get
fixed.  Once I have alerting working this way, I'll point the emails at
the people who are responsible and then I can forget about it.

Thanks for all your help and input!

-- 
***********************************************************************
* John Oliver                             http://www.john-oliver.net/ *
*                                                                     *
***********************************************************************

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: How to troubleshoot when not receiving alerts?]
Next message: Notification using notify-html-email.sh
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list