Restarts resetting soft critical states
Martin Melin
mmelin at gmail.com
Fri Nov 6 15:01:34 CET 2009
Hi Mark,
Sorry, was caught up with other stuff so couldn't reply to this.
It looks like your retention settings are fine. It would be interesting if
you were able to recreate the problem, but my guess is that it probably
requires the special circumstances of your production environment, so this
may not be prudent.
Nagios will write retention.dat on every clean shutdown and every
$retention_interval minutes. Your logs look like Nagios did a clean
shutdown, but maybe it for some reason didn't write retention.dat.
In any case, I can't really see anything wrong here, but if I have time I'm
going to see if I can replicate the behavior you experienced.
Best regards,
Martin Melin
On Wed, Nov 4, 2009 at 4:10 AM, Frost, Mark {PBG} <mark.frost1 at pepsi.com>wrote:
>
>
> >From: Martin Melin [mailto:mmelin at gmail.com]
> >Sent: Tuesday, November 03, 2009 4:05 PM
> >To: nagios-users at lists.sourceforge.net
> >Subject: Re: [Nagios-users] Restarts resetting soft critical states
> >
> >> On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <
> mark.frost1 at pepsi.com> wrote:
> >>
> >>
> >>-----Original Message-----
> >>>From: Andreas Ericsson [mailto:ae at op5.se]
> >>>Sent: Monday, November 02, 2009 7:02 AM
> >>>To: Frost, Mark {PBG}
> >>>Cc: nagios-users at lists.sourceforge.net
> >>>Subject: Re: [Nagios-users] Restarts resetting soft critical states
> >>>
> >>>> On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote:
> >>>>
> >>>> Both the reporting server and the distributed node share the same
> >>>> attributes for retention and soft states:
> >>>>
> >>>> soft_state_dependencies=0
> >>>> passive_host_checks_are_soft=1
> >>>> retain_state_information=1
> >>>> use_retained_program_state=1
> >>>> use_retained_scheduling_info=1
> >>>> retained_host_attribute_mask=0
> >>>> retained_service_attribute_mask=0
> >>>> retained_process_host_attribute_mask=0
> >>>> retained_process_service_attribute_mask=0
> >>>> retained_contact_host_attribute_mask=0
> >>>> retained_contact_service_attribute_mask=0
> >>>>
> >>>> While I would assume the restarts would disrupt Nagios a bit what with
> >>>> having to do start-time tasks again, I would not have expected that it
> >>>> would "start over" with the status of some checks.
> >>>>
> >>>> What am I missing here?
> >>>
> >>>
> >>> It seems you haven't grasped how bitmasks work. When you set the mask
> to
> >>> 0,
> >>> you essentially tell it to not let anything through. Set them to -1, or
> >>> leave them at the default values and you'll get the kind of state
> >>> retention
> >>> you want.
> >>
> >> Thanks, Andreas. Unfortunately, I'm still puzzled. The mask values you
> refer to are
> >> already set to the defaults (they're all 0's). I've never touched those
> or paid much
> >> attention to them until now.
> >>
> >> I'm actually confused by 2 aspects of this. It seems to me that the
> thing I'm trying to >> retain across a restart are soft check states (those
> are what are being reset). Looking >> at the MODATTR arguments in
> include/common.h (3.0.6) I don't see which of those
> >> attributes >would govern this. There's the *ENABLED attributes which
> really aren't
> >> changing here (and >are retained). All the other MODATTR's are (it
> seems to me) not
> >> changing in this case >either.
> >>
> >> The second thing that confuses me here is the verbage used to describe
> the mask
> >> functionality:
> >>
> >> # RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE)
> >> # The following variables are used to specify specific host and
> >> # service attributes that should *not* be retained by Nagios
> during
> >> # program restarts.
> >>
> >> So if MODATTR is set to none, based on the comment doesn't this mean
> that "NONE" of the >> attributes are NOT retained? I.e. all are retained
> (double-negative)? The on-line doc >> for these masks say "By default, all
> host and service attributes are retained."
> >>
> >>
> > I don't know the source code behavior, but I agree with this and a
> default nagios.cfg has > all of the masks set to zero, presumably to not
> mask anything, i.e. to not affect what's > retained.
> >>
> >>
> >> I do get masks, I just didn't see how these applied here.
> >>
> >> Your help is greatly appreciated.
> >>
> > I just did a quick experiment with the default values for *retain*
> variables in
> > nagios.cfg - which are exactly what you quote:
> >
> > [1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE
> CRITICAL: File not found - /tmp/nagios
> > [1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE
> CRITICAL: File not found - /tmp/nagios
> > [1257281604] Caught SIGTERM, shutting down...
> > [1257281604] Successfully shutdown... (PID=9617)
> > [1257281605] Nagios 3.0.6 starting... (PID=9721)
> > [1257281605] Local time is Tue Nov 03 21:53:25 CET 2009
> > [1257281605] LOG VERSION: 2.0
> > [1257281605] Finished daemonizing... (New PID=9722)
> > [1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE
> CRITICAL: File >> not found - /tmp/nagios
> >
> > Everything works as expected.
> >
> > I'm guessing you have some other issue that's affecting Nagios' ability
> to save retention data.
> >
> > What's the value of state_retention_file and retention_update_interval
> for you?
> >
> > Have you checked that state_retention_file is updated when Nagios runs,
> that you're not close to capacity of the disk or that something basic like
> that is going on?
> >
> > Open up the file and grab the definition for the service in question, see
> what values are being saved.
> >
> > HTH,
> >
> > Regards,
> > Martin Melin
>
> Martin,
>
> state_retention_file=/usr/local/eam/nagios/var/retention.dat
> retention_update_interval=60
>
> I see that retention.dat was updated when I restarted Nagios maybe 20
> minutes ago. I just tested disabling notifications for a check, but I guess
> based on my retention_update_interval I won't see the retention.dat file
> change for another 40 minutes.
>
> Nagios monitors the filesystem itself (ie. Nagios watches itself), but the
> filesystem it resides on is at 35% with 12GB free. If there were a problem
> with that or some other essential operation of Nagios, I think I'd see some
> other problem. In this case, I think an unusual set of circumstances were
> at play -- I was restarting Nagios every few minutes while a host was in the
> process of failing host checks as reported by the distributed nodes. Never
> seen that before, but also probably never happened to do it that way either.
>
> Looking at this item in retention.dat (it's a host check that we had this
> issue with, not a service check). This might not be all that useful as this
> issue isn't happening at the moment. At present, I see the following of
> interest
>
> last_state=0
> last_hard_state=0
> current_attempt=1
> max_attempts=10
> state_history=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
>
> I assume perhaps that "last_state" might mean last soft state. It would be
> interesting to see this value if I could find a practical way to replicate
> this condition. I would also expect current_attempt to be higher than 1 and
> the state_history to show some non-OK states while this issue was happening.
> As I say, I'd have to see these values while this was changing.
>
> Thanks
>
> Mark
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20091106/53827017/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list