Restarts resetting soft critical states

Martin Melin mmelin at gmail.com
Tue Nov 3 22:04:33 CET 2009


On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <mark.frost1 at pepsi.com>wrote:

>
>
> >-----Original Message-----
> >From: Andreas Ericsson [mailto:ae at op5.se]
> >Sent: Monday, November 02, 2009 7:02 AM
> >To: Frost, Mark {PBG}
> >Cc: nagios-users at lists.sourceforge.net
> >Subject: Re: [Nagios-users] Restarts resetting soft critical states
> >
> >On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote:
> >> You think you know an application and every once in a while you get a
> >surprise...
> >>
> >> Both the reporting server and the distributed node share the same
> >attributes for retention and soft states:
> >>
> >> soft_state_dependencies=0
> >> passive_host_checks_are_soft=1
> >> retain_state_information=1
> >> use_retained_program_state=1
> >> use_retained_scheduling_info=1
> >> retained_host_attribute_mask=0
> >> retained_service_attribute_mask=0
> >> retained_process_host_attribute_mask=0
> >> retained_process_service_attribute_mask=0
> >> retained_contact_host_attribute_mask=0
> >> retained_contact_service_attribute_mask=0
> >>
> >> While I would assume the restarts would disrupt Nagios a bit what with
> >> having to do start-time tasks again, I would not have expected that it
> >>  would "start over" with the status of some checks.
> >>
> >> What am I missing here?
> >>
> >
> >It seems you haven't grasped how bitmasks work. When you set the mask to
> >0,
> >you essentially tell it to not let anything through. Set them to -1, or
> >leave them at the default values and you'll get the kind of state
> >retention
> >you want.
> >
>
> Thanks, Andreas.  Unfortunately, I'm still puzzled.  The mask values you
> refer to are already set to the defaults (they're all 0's).  I've never
> touched those or paid much attention to them until now.
>
> I'm actually confused by 2 aspects of this.  It seems to me that the thing
> I'm trying to retain across a restart are soft check states (those are what
> are being reset).  Looking at the MODATTR arguments in include/common.h
> (3.0.6) I don't see which of those attributes would govern this.  There's
> the *ENABLED attributes which really aren't changing here (and are
> retained).  All the other MODATTR's are (it seems to me) not changing in
> this case either.
>
> The second thing that confuses me here is the verbage used to describe the
> mask functionality:
>
>        # RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE)
>        # The following variables are used to specify specific host and
>        # service attributes that should *not* be retained by Nagios during
>        # program restarts.
>
> So if MODATTR is set to none, based on the comment doesn't this mean that
> "NONE" of the attributes are NOT retained?  I.e. all are retained
> (double-negative)?  The on-line doc for these masks say "By default, all
> host and service attributes are retained."
>

I don't know the source code behavior, but I agree with this and a default
nagios.cfg has all of the masks set to zero, presumably to not mask
anything, i.e. to not affect what's retained.


>
> I do get masks, I just didn't see how these applied here.
>
> Your help is greatly appreciated.
>
>
I just did a quick experiment with the default values for *retain* variables
in nagios.cfg - which are exactly what you quote:

[1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE
CRITICAL: File not found - /tmp/nagios
[1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE
CRITICAL: File not found - /tmp/nagios
[1257281604] Caught SIGTERM, shutting down...
[1257281604] Successfully shutdown... (PID=9617)
[1257281605] Nagios 3.0.6 starting... (PID=9721)
[1257281605] Local time is Tue Nov 03 21:53:25 CET 2009
[1257281605] LOG VERSION: 2.0
[1257281605] Finished daemonizing... (New PID=9722)
[1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE
CRITICAL: File not found - /tmp/nagios

Everything works as expected.

I'm guessing you have some other issue that's affecting Nagios' ability to
save retention data.

What's the value of state_retention_file and retention_update_interval for
you?

Have you checked that state_retention_file is updated when Nagios runs, that
you're not close to capacity of the disk or that something basic like that
is going on?

Open up the file and grab the definition for the service in question, see
what values are being saved.

HTH,

Regards,
Martin Melin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20091103/da9029f2/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list